HazyResearch / pdftotree

:evergreen_tree: A tool for converting PDF into hOCR with text, tables, and figures being recognized and preserved.
MIT License
428 stars 90 forks source link

Embed Base64-Encoded Images Inline #99

Closed HiromuHota closed 3 years ago

HiromuHota commented 3 years ago

Description of the problems or issues

Is your pull request related to a problem? Please describe.

See #88

Does your pull request fix any issue.

Close #88

Description of the proposed changes

Embed base64-encoded images inline. Support starting with JPEG and BMP.

Test plan

Apply pdftotree to pdfs and see if JPEG and BMP images are extracted.

Checklist

codecov-io commented 3 years ago

Codecov Report

Merging #99 into master will increase coverage by 0.42%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #99      +/-   ##
==========================================
+ Coverage   66.40%   66.83%   +0.42%     
==========================================
  Files          23       24       +1     
  Lines        2566     2593      +27     
==========================================
+ Hits         1704     1733      +29     
+ Misses        862      860       -2     
Flag Coverage Δ
#unittests 66.83% <100.00%> (+0.42%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pdftotree/TreeExtract.py 89.53% <100.00%> (+0.79%) :arrow_up:
pdftotree/ml/features.py 64.58% <100.00%> (ø)
pdftotree/utils/bbox_utils.py 21.91% <100.00%> (+4.52%) :arrow_up:
pdftotree/utils/pdf/node.py 45.39% <100.00%> (+0.39%) :arrow_up:
pdftotree/utils/pdf/pdf_parsers.py 92.61% <100.00%> (+0.11%) :arrow_up:
tests/test_figures.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update d4e5dc8...1cfa226. Read the comment docs.

lukehsiao commented 3 years ago

What happens today for a PNG?

Adobe goes with the alternative you mention in the issue (saving it as a file, and pointing to that instead), which seems to work fairly well.

lukehsiao commented 3 years ago

Ah, I see now that it looks like you can just specify the MIME type, so we should be able to support a wide variety of image formats this way, right? Basically, I want to check that this approach will be support a wide variety of images.

HiromuHota commented 3 years ago

Ah, I see now that it looks like you can just specify the MIME type, so we should be able to support a wide variety of image formats this way, right? Basically, I want to check that this approach will be support a wide variety of images.

Yes, this approach can support a wide variety of image types. https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URIs In fact, JPEG and BMP are just an initial set of supported types. We will add PNG and other types too when we have such an example PDF containing different types of image.

I like the fact that hOCR with embedded images is more portable than hOCR with multiple images. Please let me know your thoughts and experiences.

lukehsiao commented 3 years ago

I like the fact that hOCR with embedded images is more portable than hOCR with multiple images. Please let me know your thoughts and experiences.

I don't know about portable vs not portable, but I think embedding them seems just fine. I assume I could download the embedded image as a file anyways, right?

HiromuHota commented 3 years ago

@lukehsiao Yes, you can open an exported hocr and save an embedded image as a separate file. In the future, pdftotree might have options like --embed-image <0|1> (Default: 1) as in pdf2htmlEX. https://github.com/coolwanglu/pdf2htmlEX/wiki/Command-Line-Options With this, you can control whether images are embedded or saved as a separate file.

lukehsiao commented 3 years ago

Sounds great. Feel free to merge whenever you'd like.

HiromuHota commented 3 years ago

For future reference, this PR relies on pdfminer.six for image type detection, which cannot detect an image type sometimes. This PR embeds images only when their image types are detected and does not embed the image of unknown type. To detect an image type on broader cases, we have to