CrystalEye42 / OpenChemIE

MIT License
40 stars 4 forks source link

Cropped images from figures #16

Closed SGenheden closed 5 days ago

SGenheden commented 3 weeks ago

I often find it that the routine to extract images from figures is cropping the images too much and this is of course affecting possibility to extract molecules and reactions from these images.

I am cloning the GitHub repo from here and follow the installation instructions.

Example from: https://pubs.acs.org/doi/10.1021/acs.jmedchem.8b00644

I download the PDF and do something like this:

import torch
from openchemie import OpenChemIE
model = OpenChemIE(device=torch.device('cpu')) 
pdf_path = "odagiri-et-al-2018-design-synthesis-and-biological-evaluation-of-novel-7-(3as-7as)-3a-aminohexahydropyrano-3-4-c-pyrrol.pdf"
figures = model.extract_figures_from_pdf(pdf_path, output_bbox=True)
figures[1]["figure"]["image"]

gives

image

Is there an easy way to fix this? Any suggestion is much appreciated.

Ozymandias314 commented 2 weeks ago

Hello, Unfortunately, the initial PDF processing is done by off the shelf tools. Specifically, image segmentation is implemented through LayourParser. We had previously noticed issues too where the cropped images are bad at the boundaries, but addressing this issue was outside of the scope of the issue. One suggestion I have is to use a newer tool for PDF preprocessing. I recommend looking into marker, which may be more robust. Please let us know if you find any alternative solutions!

SGenheden commented 2 weeks ago

Thanks for the quick reply. And thanks for confirming that this is a known limitation of the approach. I must say it was a bit surprising to me, and it basically means that currently OpenChemIE cannot really be used in a black-box fashion to extract reaction data. I will look into "marker" and other tools.