VikParuchuri / marker

Convert PDF to markdown quickly with high accuracy
https://www.datalab.to
GNU General Public License v3.0
14.15k stars 720 forks source link

get_text_blocks bbox does not match the reality #142

Closed homeant closed 1 month ago

homeant commented 1 month ago

page bbox: [0, 0, 596, 842]

blocks: [0.08315436472028694, 0.02482144679706057, 0.4591443362652055, 0.04149600502430968] [0.7733221374102087, 0.9284518824054057, 0.9277516819486682, 0.9649548677820491]

homeant commented 1 month ago

Resolved the issue where the bounding box (bbox) could not be modified because a new dictionary was being recreated due to a block.

fix: https://github.com/VikParuchuri/pdftext/pull/4

VikParuchuri commented 1 month ago

Thank you for raising this issue - I've fixed it in pdftext and marker. Will release the marker update soon. (thanks for the pdftext fix, but I ended up fixing it a slightly different way).