docwire / docwire

DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boost efficiency in text extraction, web data extraction, data mining, document analysis. Offline processing is possible for security and confidentiality
https://docwire.io
Other
63 stars 14 forks source link

Text within text boxes in RTF and HTML documents are not converted #151

Open efieleke-tausight opened 1 day ago

efieleke-tausight commented 1 day ago

This is inconsistent with other formats, like docx, where docwire includes such text. I have attached an example .docx and .rtf file, which contain the same content, but where docwire returns empty text for the RTF file. I've also attached an HTML file (with supporting subfolder) that has the same content.

05co1120sfige.zip 05co1120sfige.zip

as-ascii commented 20 hours ago

This html file does not look like standard html file, there are a lot of probably meaningful data embedded inside html comments. Needs further analysis.

RTF as well because text from graph labels cannot be directly found in rtf file, they are probably embedded converted to an additional subformat - probably something called "Word 97 RTF for Drawing Objects (Shapes)".