As a prep for a possible future table extraction feature i was trying to see if i can get this code better at implementing spacing. both horizontal and vertical.
the current algorithm (before those changes) doesn't really consider spacing between the texts before concatenating them. so if texts in the same row dont have a literal space "glyph" between them but are instead spaced using placement that puts that at distance from each other...this algo will probably think it's the same word and wont have space between them.
Also there's no vertical spacing. lines are just placed one after the other without considering the vertical spacing between them. a better algorithm will make it easy to figure out one paragraph from another.
So - this PR. This PR checks spacing between text placements and lines and based on the font the line/text is in determine how many spaces match the distance between those text placements. i reckon the results are not too bad per what i'm seeing in the testing.
while implementing this i discovered a severe bug in the existing code where CID width reading is totally wrong. I had a bug there all this time. As a result good chances are that any text in a CID font got something like 0 width or any other irrelevant width. anyways...sorted this out.
didn't do initial indentation. i mean texts can be indented in the line and one can add spaces in the beginning of the line (adhering to BIDI text), but i left this for another time and stuck to spacing between lines and between words in a line.
The option to add spaces is open by default, but can be shut down (or you can select vertical/horizontal) by providing a relevant CLI argument (see readme/cli help).
As a prep for a possible future table extraction feature i was trying to see if i can get this code better at implementing spacing. both horizontal and vertical.
the current algorithm (before those changes) doesn't really consider spacing between the texts before concatenating them. so if texts in the same row dont have a literal space "glyph" between them but are instead spaced using placement that puts that at distance from each other...this algo will probably think it's the same word and wont have space between them. Also there's no vertical spacing. lines are just placed one after the other without considering the vertical spacing between them. a better algorithm will make it easy to figure out one paragraph from another.
So - this PR. This PR checks spacing between text placements and lines and based on the font the line/text is in determine how many spaces match the distance between those text placements. i reckon the results are not too bad per what i'm seeing in the testing.
while implementing this i discovered a severe bug in the existing code where CID width reading is totally wrong. I had a bug there all this time. As a result good chances are that any text in a CID font got something like 0 width or any other irrelevant width. anyways...sorted this out.
didn't do initial indentation. i mean texts can be indented in the line and one can add spaces in the beginning of the line (adhering to BIDI text), but i left this for another time and stuck to spacing between lines and between words in a line.
The option to add spaces is open by default, but can be shut down (or you can select vertical/horizontal) by providing a relevant CLI argument (see readme/cli help).