optimize text extraction

wrharris commented 3 years ago

Currently, text extraction adds roughly 10x overhead to parsing a PDF. To optimize it, we can:

generate a C++ parser, possibly by supporting any primitives not supported already;
optimize the content stream parser to avoid duplicating fonts;
optimize the text extractor to avoid recomputing Unicode lookup tables for each text-showing operation.

yav commented 3 years ago

Another thing to do would be to do some code review on the source DDL to see if there are options there. In particular, things like creating a separate little structure for each individual character sound extremely expensive.

wrharris commented 3 years ago

Agreed; to clarify, items (2) and (3) are both issues with inefficient DaeDaLus code.

Another small inefficiency: the page tree parser currently tries to parse each kind of child (page or page tree node) and on failure, backtracks completely to try the other type. Although, this probably doesn't matter too much, given it will always reject the wrong kind of child quickly, after parsing the Type field.

wrharris commented 3 years ago

Update: this would still be nice to investigate, but is relatively low priority given that performance of text extraction is not currently of outside interest.

wrharris commented 2 years ago

Update: text extraction has now been streamlined to be implemented in C++.

In terms of issues to address, there are three main challenge areas in the project evaluation:

extract text left-to-right languages: this basically evaluates the baseline algorithm
extract text in right-to-left languages: the standard list some constructs that could feasibly be used for extracting text from these (namely, the Content Stream operations ReverseChars and ActualText and the Direction field of 147), but it's unclear to me that it was true, and a small real-world example of Arabic (add to the repos, arabic-deflate.pdf) doesn't indicate that these are used.
include spaces: this can be endlessly fine-tuned, but we can perhaps get satisfying results by injecting whitespace based on a quick scan of the text-positioning operators.

GaloisInc / daedalus

optimize text extraction #165