Closed emdeh closed 3 months ago
To maintain modularity and minimise complexity in the existing function, the plan is to create a new function specifically for handling the new document type and patterns. Depending on the detected document type, this function can then be called by a super function or directly.
The outline is like this:
Super Function: This function determines the document type or splitting strategy needed based on initial analysis (e.g., scanning the first few pages for specific patterns). It then delegates to the appropriate processing function. The document type could also be determined by the user selection.
New Function for New Document Type: Tailored to handle the "Statement Number" and "Continued overleaf..." patterns, providing specific logic for splitting based on these cues.
This approach separates the logic for different document types, improving code clarity and making it easier to adapt to future requirements or document types.
Created a new function called find_statement_number_starts()
which deals with the use-case where there is no 1 of x pattern on the first page by looking for a regular expression present on the new document type. The regular expression is re.search(r'Statement number\s+(\d+)', page_text)
.
A new function named detect_document_type
invokes this function in the core process_all_pdfs()
function if it finds a keyword on the first three pages.
split_pdf
has been updated to accommodate the new find function.
The existing find function has had its name updated.
Context
The function,
process_all_pdfs()
, automates the process of splitting PDF documents containing more than one discrete document. It separate files based on a set pattern ("1 of x"). The split files are saved to an output folder. If a PDF doesn't contain the specified pattern and thus can't be automatically split, it's moved to a manual processing folder for later attention. The operation proceeds as follows:1. Setup and Iteration: Iterates through each PDF file found in the given input_folder. This is achieved by using the Path object from the pathlib module to glob (pattern match) for .pdf files. 2. Document Start Identification: For each PDF file, the function calls
find_document_starts()
with the file's path as an argument. This function is expected to return a list of page numbers where new documents start within the PDF. If this list contains only one element, it implies the PDF file consists of a single document. If the list contains multiple elements, these are the starting pages of multiple documents contained within the PDF.split_pdf()
function is called. This function is tasked with splitting the original PDF into separate documents based on the page numbers identified earlier, and saving those splits to the output_folder. 5. Manual Processing: If no document starts are identified (meaning the function couldn't determine how to split the PDF automatically), the file is moved to a manual_processingfolder. This scenario suggests that the PDF either doesn't contain the pattern "1 of x"_ or contains it in a manner that wasn't detected. In addition to moving the file, the function updates or creates a manifest file within the manual processing folder. This manifest lists the filenames of all PDFs requiring manual intervention. 6. Completion and Cleanup: After processing all PDFs in the input folder, the function concludes with a message indicating the completion of the splitting process.Issue
The function does not cover all use cases. Some documents that need to be processed have a different pattern. For example, a new use case is where the document type does not have a 1 of x on the first page. Instead, it says "Statement Number 1" on each page.
On every other page it has page x of x in brackets next to the statement number.
So for example, if there were two statements in a pdf, it would say "Statement Number 1" on the first page, then on the second page it will say Statement Number 1(page 2 of 3, then on the third page it would say Statement Number1(3 of 3), then on the fourth page it will say Statement Number 2, then on the fifth page it will say Statement Number 2(2 of x), and so on.
There is also has the string "Continued overleaf..." on each page of a statement except the last. So, for example, if there were two statements in a pdf, it would have those words on every page except the last page of each statement.
Relevant code