coolwanglu / pdf2htmlEX

Convert PDF to HTML without losing text or format.
http://coolwanglu.github.com/pdf2htmlEX/
Other
10.35k stars 1.84k forks source link

--correct-text-visibility hiding too much text? #405

Open davidhedley opened 10 years ago

davidhedley commented 10 years ago

In the following test case, Using "--correct-text-visibility 1" is pushing text to the background layer when it is not obscured.

Test case here: http://download.vistair.com/pdf2htmlEX/Page-241fromCFE-E190SR-B4.pdf

Version info: Copyright 2012-2014 Lu Wang coolwanglu@gmail.com and other contributors Libraries: poppler 0.26.3 libfontforge 20140801 cairo 1.13.1 Default data-dir: /usr/local/share/pdf2htmlEX Supported image format: png jpg svg

duanyao commented 10 years ago

This a a known limitation, we use bounding boxes of chars, images, and paths to evaluate chars' visibility. This is simple and fast, but sometimes too conservative. In your PDF, there are some non-rectangular paths. They don't overlap any text, however their bounding boxes do, so the overlapped texts is treated as if obscured.