fix: extract page breaks from .docx files

jonstrutz11 commented 3 months ago

Context

Currently, the DOCXToDocument converter does not extract page breaks from word documents. This makes it impossible to do things like split by page or get correct page number metadata after using something like DocumentSplitter. For example, if you split a .docx file by word, the 'page_number' metadata field will be 1 for all documents.

Proposed Changes:

Added a method to DOCXToDocument that extracts page breaks from word documents as '\f' characters so that they are properly recognized by DocumentSplitter.

How did you test it?

Wrote a unit test (with a new test word document also added)
Ran unit tests and all passed
Tested manually with a custom RAG pipeline I'm building where I had to index 1,000+ word docs.

Notes for the reviewer

Due to the way the python-docx library is set up, we can only accurately determine the location of the first page break for a given paragraph. In the rare case that a paragraph contains more than one page break (which means it is an extremely long paragraph spanning multiple pages), the 2nd, 3rd, etc. page break locations are not known. To try and remedy this, I just appended extra page break characters to the end of the paragraph text to keep the overall page number values for the document consistent.

Also, for more complex documents that might not have a Paragraph element on every page, the page numbers will not be 100% accurate (e.g. off by 1 or 2). However, page numbers should be accurate for simple documents, and even if inaccurate I'd argue that being off by a few pages is still better than having the page_number metadata of every Haystack document (extracted from .docx files) be 1. It seems like getting accurate page numbers in .docx files is quite challenging due to the way .docx files are rendered.

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes - please let me know if I need to create a related issue for this
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
I documented my code
I ran pre-commit hooks and fixed any issue

CLAassistant commented 3 months ago

All committers have signed the CLA.

coveralls commented 3 months ago

Pull Request Test Coverage Report for Build 10487208610

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.02%) to 90.147%

Totals
Change from base Build 10419214121:	0.02%
Covered Lines:	6953
Relevant Lines:	7713

deepset-ai / haystack