Closed jonstrutz11 closed 2 months ago
This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Totals | |
---|---|
Change from base Build 10419214121: | 0.02% |
Covered Lines: | 6953 |
Relevant Lines: | 7713 |
Context
Currently, the DOCXToDocument converter does not extract page breaks from word documents. This makes it impossible to do things like split by page or get correct page number metadata after using something like DocumentSplitter. For example, if you split a .docx file by word, the 'page_number' metadata field will be 1 for all documents.
Proposed Changes:
Added a method to DOCXToDocument that extracts page breaks from word documents as '\f' characters so that they are properly recognized by DocumentSplitter.
How did you test it?
Notes for the reviewer
Due to the way the python-docx library is set up, we can only accurately determine the location of the first page break for a given paragraph. In the rare case that a paragraph contains more than one page break (which means it is an extremely long paragraph spanning multiple pages), the 2nd, 3rd, etc. page break locations are not known. To try and remedy this, I just appended extra page break characters to the end of the paragraph text to keep the overall page number values for the document consistent.
Also, for more complex documents that might not have a
Paragraph
element on every page, the page numbers will not be 100% accurate (e.g. off by 1 or 2). However, page numbers should be accurate for simple documents, and even if inaccurate I'd argue that being off by a few pages is still better than having thepage_number
metadata of every Haystack document (extracted from .docx files) be 1. It seems like getting accurate page numbers in .docx files is quite challenging due to the way .docx files are rendered.Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.