Open wrharris opened 3 years ago
Another thing to do would be to do some code review on the source DDL to see if there are options there. In particular, things like creating a separate little structure for each individual character sound extremely expensive.
Agreed; to clarify, items (2) and (3) are both issues with inefficient DaeDaLus code.
Another small inefficiency: the page tree parser currently tries to parse each kind of child (page or page tree node) and on failure, backtracks completely to try the other type. Although, this probably doesn't matter too much, given it will always reject the wrong kind of child quickly, after parsing the Type
field.
Update: this would still be nice to investigate, but is relatively low priority given that performance of text extraction is not currently of outside interest.
Update: text extraction has now been streamlined to be implemented in C++.
In terms of issues to address, there are three main challenge areas in the project evaluation:
ReverseChars
and ActualText
and the Direction
field of 147), but it's unclear to me that it was true, and a small real-world example of Arabic (add to the repos, arabic-deflate.pdf
) doesn't indicate that these are used.
Currently, text extraction adds roughly 10x overhead to parsing a PDF. To optimize it, we can: