docwire / docwire

DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boost efficiency in text extraction, web data extraction, data mining, document analysis. Offline processing is possible for security and confidentiality
https://docwire.io
Other
63 stars 13 forks source link

DOCX conversion strips out spaces between words #139

Closed efieleke-tausight closed 5 days ago

efieleke-tausight commented 1 month ago

When I convert the attached document using docwire, it removes some spaces between words. For example, instead of "Motion Picture Association" it has "MotionPictureAssociation"

A little further down, instead of "Dear Mr. Carson:" it has "DearMr.Carson:"

This file is rife with examples of this problem. Perhaps the space character in the Word document isn't a standard space. But if I copy the text from Word and paste it into Notepad, Notepad has the proper spaces between words. 123mpaa.docx

efieleke-tausight commented 1 month ago

I was using version 2024.06.24, BTW

efieleke-tausight commented 1 month ago

We are seeing that the old doc2Text is not stripping away these spaces between words, given the same file. So this appears to be a regression.

For reference, here is how we call into docwire. In this case, GetMappedExtension is returning "docx" (and the file really is a docx file).

    auto iStream = std::make_shared<boost::iostreams::stream<boost::iostreams::array_source>>(buffer, bufferSize);

    docwire::data_source dataSource(
        docwire::seekable_stream_ptr { iStream },
        docwire::file_extension("." + GetMappedExtension(extension)));

    std::basic_stringstream<char> outStream;

    std::move(dataSource) | 
        docwire::ParseDetectedFormat<docwire::OfficeFormatsParserProvider, docwire::MailParserProvider, 
 docwire::OcrParserProvider>(_parameters) |
        docwire::PlainTextExporter() |
        outStream;
as-ascii commented 1 month ago

It seems that the problematic spaces are encoded with a text tag with single space only:

<w:t> </w:t>

There is no "space=preserve" attribute so it's on the edge of the official OOXML standard I would say. It is possible that older versions treat it differently because there were some optimizations in the code in latest releases, including this parser.

What needs to be done:

as-ascii commented 1 month ago

All versions down to 5.0.8 are not sending text tags to parsing chain if text contains only spaces unless xml tag has "preserve=space" attribute:

if (space_preserve || !std::all_of(s.begin(), s.end(), [](auto c){return isspace(static_cast<unsigned char>(c));}))
        send_tag(tag::Text{.text = s});
as-ascii commented 1 month ago

The stripping was introduced in version 5.0.2:

Version 5.0.2:
   ...
   * Ignore all XML text nodes that contains only whitespaces to fix crashes with some ODF and OOXML documents
   * Add support for T tag in OOXML formats with whitespace preserve attribute

This is a commit (in previous, private repository) that introduced it: https://github.com/docwire/doctotext_old_private/commit/ad676e298e5de7b9a3d22e26609332ef45f3a8e4 And pull request: https://github.com/docwire/doctotext_old_private/pull/948 It would be good to find the issue with problematic documents and recheck them after new modifications.

as-ascii commented 1 month ago

http://officeopenxml.com/WPtextSpacing.php : As with any XML, to preserve spaces in OOXML, the space attribute must be set to preserve: <w:t xml:space="preserve">. So it seems that the document is not correct according to the specification. Of course if this is a common situation a workaround for this would be perfect but we need to make sure that it will not introduce errors for other documents.

as-ascii commented 1 month ago

It seems that the space=preserve is set in the document header instead of w:t tag. Maybe supporting this can be a workaround for the issue.

pcoramasionwu commented 3 weeks ago

Expected Outcome: All space characters in all Office document types are preserved exactly as they would appear in standard software applications, such as when opening a DOCX file in Microsoft Word.

as-ascii commented 3 weeks ago

Agreed for documents that keeps the official standard. When implementing workarounds for documents that do not keep the official standard we can only base on available test documents including attached in this issue. If we can collect more documents with this problem that would be perfect.

as-ascii commented 2 weeks ago

The implementation is complete but it will be merged later, together with other improvements.