DS4SD / docling-parse

Simple package to extract text with coordinates from programmatic PDFs
MIT License
29 stars 8 forks source link

Adding a new optimised v2 parser #30

Closed staar closed 1 month ago

staar commented 1 month ago

The current v1 parser is slow with regard to parsing documents, primarily due to a lot of inheritance. In the new v2, we will remove any inheritance and improve the parsing speed.

In addition, we would like to:

  1. dynamically load the fonts (only load if necessary)
  2. remove any inheritance
  3. have the capability to switch between snippet and word level
PeterStaar-IBM commented 1 month ago

Integrated with the latest PR into main