VisSieve / main

https://vissieve.github.io/main/documentation/site
0 stars 0 forks source link

investigate alternatives for pdf image parsing #6

Closed DevinBayly closed 1 week ago

DevinBayly commented 1 month ago

check whether there’s other systems for the pdf image dumping

DevinBayly commented 1 month ago

Spending time with pymupdf shows that I think the issue here is that many figures are considered graphics not images, so they end up not getting picked up.

DevinBayly commented 1 month ago

spending a bunch of time on this, and although I think our best alternative is pymupdf this is still a complex program. I think the main benefit it provides is that it will let us pull graphics from the pages, but really these are mostly diagrams I feel.

It looks like Ben's pdf grabber code actually goes through and deletes almost all the ppm files that come from the pdfimages tool, and sometimes this is the right move, and in other cases this actually tosses out instances of results that we want to keep

see this screenshot of the result of using pdfimages -j 2187855029.pdf pdf_im.

Image in here we see many black images that are also ppm, but there's a number that aren't. In some cases the -j flag actually makes a jpg option also for these images, but in this case it didn't do it for all the images.

I think that means that we need to be careful when running the pdf image dump code as it is currently written.

DevinBayly commented 1 month ago

https://github.com/VisSieve/main/blob/1b09d0bcc7851eeb63524c4e730499eba59cb7ef/openalex_code/grabbers.py#L35 is the line that I think we need to update.

DevinBayly commented 1 week ago

we are really pushing on this now, to use a text detector that finds figure sections, and then uses logic to get the region of the pdf page that corresponds to the image