bitextor / pdf-extract

PDF parser and converter to HTML
GNU General Public License v3.0
83 stars 14 forks source link

Regression -- mangled text #3

Closed wwaites closed 5 years ago

wwaites commented 5 years ago

When I pass my PhD thesis through pdf-extract, text gets mangled with words dropped:

    <p id="page14c1p15" style="top:313.61838pt;left:113.386pt;width:412.0202pt;height:11.687958pt;">
            <span id="page14c1p15l1" class="line h15" style="top:313.61838pt;left:113.386pt;width:411.0202pt;height:11.687958pt;">First synthesised 140 years ago, paracetamol has a relatively simple</span>
    </p>
    <p id="page14c1p16" style="top:332.8314pt;left:113.386pt;width:412.01855pt;height:11.687958pt;">
            <span id="page14c1p16l1" class="line h15" style="top:332.8314pt;left:113.386pt;width:411.01855pt;height:11.687958pt;">structure and has been the subject of innumerable scientific studies, yet its</span>
    </p>
    <p id="page14c1p17" style="top:352.0444pt;left:113.386pt;width:412.22757pt;height:11.687958pt;">
            <span id="page14c1p17l1" class="line h15" style="top:352.0444pt;left:113.386pt;width:411.22757pt;height:11.687958pt;">of action, the nature of the causal relationship between ingesting the drug</span>
    </p>
    <p id="page14c1p18" style="top:371.2574pt;left:113.386pt;width:414.01184pt;height:11.687958pt;">
            <span id="page14c1p18l1" class="line h26" style="top:371.2574pt;left:113.386pt;width:413.01184pt;height:11.687958pt;">is not well understood. There is consensus that it has an anti-inflammatory</span>
    </p>

Missing words: chemical, mechanism, "and its". I am trying to cut it down to a minimal example that reproduces the problem so I don't have to add the whole thing to the repository for a regression test.

dionwiggins commented 5 years ago

Can you please send the document you ran so we can check on how the zones are overlapping or laid out. It may be cropped out due to column detection.

dionwiggins commented 5 years ago

Following up on this, can you please provide the PDF and we will trace the issue.

wwaites commented 5 years ago

Acknowledged. I'm trying to cut it down to a minimal example, so far without any luck. You can get the doc at

https://tardis.ed.ac.uk/~wwaites/2019/06/thesis-final.pdf

but please don't check it into the repository.

dionwiggins commented 5 years ago

Noted. I will have the engineer review and try to reproduce.

dionwiggins commented 5 years ago

Resolved.