GaloisInc / daedalus

The Daedalus data description language
BSD 3-Clause "New" or "Revised" License
65 stars 11 forks source link

optimize text extraction #165

Open wrharris opened 3 years ago

wrharris commented 3 years ago

Currently, text extraction adds roughly 10x overhead to parsing a PDF. To optimize it, we can:

  1. generate a C++ parser, possibly by supporting any primitives not supported already;
  2. optimize the content stream parser to avoid duplicating fonts;
  3. optimize the text extractor to avoid recomputing Unicode lookup tables for each text-showing operation.
yav commented 3 years ago

Another thing to do would be to do some code review on the source DDL to see if there are options there. In particular, things like creating a separate little structure for each individual character sound extremely expensive.

wrharris commented 3 years ago

Agreed; to clarify, items (2) and (3) are both issues with inefficient DaeDaLus code.

Another small inefficiency: the page tree parser currently tries to parse each kind of child (page or page tree node) and on failure, backtracks completely to try the other type. Although, this probably doesn't matter too much, given it will always reject the wrong kind of child quickly, after parsing the Type field.

wrharris commented 3 years ago

Update: this would still be nice to investigate, but is relatively low priority given that performance of text extraction is not currently of outside interest.

wrharris commented 2 years ago

Update: text extraction has now been streamlined to be implemented in C++.

In terms of issues to address, there are three main challenge areas in the project evaluation: