deepset-ai / haystack

AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
17.72k stars 1.92k forks source link

fix: extract page breaks from .docx files #8232

Closed jonstrutz11 closed 2 months ago

jonstrutz11 commented 3 months ago

Context

Currently, the DOCXToDocument converter does not extract page breaks from word documents. This makes it impossible to do things like split by page or get correct page number metadata after using something like DocumentSplitter. For example, if you split a .docx file by word, the 'page_number' metadata field will be 1 for all documents.

Proposed Changes:

Added a method to DOCXToDocument that extracts page breaks from word documents as '\f' characters so that they are properly recognized by DocumentSplitter.

How did you test it?

Notes for the reviewer

Due to the way the python-docx library is set up, we can only accurately determine the location of the first page break for a given paragraph. In the rare case that a paragraph contains more than one page break (which means it is an extremely long paragraph spanning multiple pages), the 2nd, 3rd, etc. page break locations are not known. To try and remedy this, I just appended extra page break characters to the end of the paragraph text to keep the overall page number values for the document consistent.

Also, for more complex documents that might not have a Paragraph element on every page, the page numbers will not be 100% accurate (e.g. off by 1 or 2). However, page numbers should be accurate for simple documents, and even if inaccurate I'd argue that being off by a few pages is still better than having the page_number metadata of every Haystack document (extracted from .docx files) be 1. It seems like getting accurate page numbers in .docx files is quite challenging due to the way .docx files are rendered.

Checklist

CLAassistant commented 3 months ago

CLA assistant check
All committers have signed the CLA.

coveralls commented 3 months ago

Pull Request Test Coverage Report for Build 10487208610

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details


Totals Coverage Status
Change from base Build 10419214121: 0.02%
Covered Lines: 6953
Relevant Lines: 7713

💛 - Coveralls