Closed DevinBayly closed 1 week ago
Spending time with pymupdf shows that I think the issue here is that many figures are considered graphics
not images
, so they end up not getting picked up.
spending a bunch of time on this, and although I think our best alternative is pymupdf
this is still a complex program. I think the main benefit it provides is that it will let us pull graphics
from the pages, but really these are mostly diagrams I feel.
It looks like Ben's pdf grabber code actually goes through and deletes almost all the ppm
files that come from the pdfimages
tool, and sometimes this is the right move, and in other cases this actually tosses out instances of results that we want to keep
see this screenshot of the result of using pdfimages -j 2187855029.pdf pdf_im
.
in here we see many black images that are also ppm, but there's a number that aren't. In some cases the
-j
flag actually makes a jpg option also for these images, but in this case it didn't do it for all the images.
I think that means that we need to be careful when running the pdf image dump code as it is currently written.
https://github.com/VisSieve/main/blob/1b09d0bcc7851eeb63524c4e730499eba59cb7ef/openalex_code/grabbers.py#L35 is the line that I think we need to update.
we are really pushing on this now, to use a text detector that finds figure sections, and then uses logic to get the region of the pdf page that corresponds to the image
check whether there’s other systems for the pdf image dumping