The function splitting pdfs does not cover all use-cases

emdeh commented 3 months ago

Context

The function, process_all_pdfs(), automates the process of splitting PDF documents containing more than one discrete document. It separate files based on a set pattern ("1 of x"). The split files are saved to an output folder. If a PDF doesn't contain the specified pattern and thus can't be automatically split, it's moved to a manual processing folder for later attention. The operation proceeds as follows:

1. Setup and Iteration: Iterates through each PDF file found in the given input_folder. This is achieved by using the Path object from the pathlib module to glob (pattern match) for .pdf files. 2. Document Start Identification: For each PDF file, the function calls find_document_starts() with the file's path as an argument. This function is expected to return a list of page numbers where new documents start within the PDF. If this list contains only one element, it implies the PDF file consists of a single document. If the list contains multiple elements, these are the starting pages of multiple documents contained within the PDF.

Single Document Handling: If only one document start is found (meaning the PDF does not need splitting), the original file is copied to the output_folder, with a message indicating the action. 4. Multiple Document Handling: If multiple starts are found, indicating the presence of multiple documents within a single PDF, the split_pdf() function is called. This function is tasked with splitting the original PDF into separate documents based on the page numbers identified earlier, and saving those splits to the output_folder. 5. Manual Processing: If no document starts are identified (meaning the function couldn't determine how to split the PDF automatically), the file is moved to a manual_processingfolder. This scenario suggests that the PDF either doesn't contain the pattern "1 of x"_ or contains it in a manner that wasn't detected. In addition to moving the file, the function updates or creates a manifest file within the manual processing folder. This manifest lists the filenames of all PDFs requiring manual intervention. 6. Completion and Cleanup: After processing all PDFs in the input folder, the function concludes with a message indicating the completion of the splitting process.

Issue

The function does not cover all use cases. Some documents that need to be processed have a different pattern. For example, a new use case is where the document type does not have a 1 of x on the first page. Instead, it says "Statement Number 1" on each page.

On every other page it has page x of x in brackets next to the statement number.

So for example, if there were two statements in a pdf, it would say "Statement Number 1" on the first page, then on the second page it will say Statement Number 1(page 2 of 3, then on the third page it would say Statement Number1(3 of 3), then on the fourth page it will say Statement Number 2, then on the fifth page it will say Statement Number 2(2 of x), and so on.

There is also has the string "Continued overleaf..." on each page of a statement except the last. So, for example, if there were two statements in a pdf, it would have those words on every page except the last page of each statement.

Relevant code

def process_all_pdfs(input_folder, output_folder, manual_processing_folder):
    """
    Processes all PDF files in a given folder, splitting them into separate documents.

    This function iterates over all PDF files in the input folder, identifies the
    starting pages of documents within each PDF, and splits them into separate PDF files.
    The new PDF files are saved to the specified output folder.
    If no '1 of x' pattern is found, the PDF file is moved to the manual processing folder.

    Args:
        input_folder (str): The folder containing the PDF files to process.
        output_folder (str): The folder where the split PDFs will be saved.
        manual_processing_folder (str): The folder where the PDF files without '1 of x' pattern will be moved.
    """
    print(f"Splitting files in {input_folder} and saving individual statements to {output_folder}...\n")
    # Iterate through each PDF file in the input folder
    for pdf_file in Path(input_folder).glob('*.pdf'):
        pdf_path = str(pdf_file)
        # Find the starting pages of documents within the PDF.
        doc_starts = find_document_starts(pdf_path)
        if len(doc_starts) == 1:
            # If only one document start is found, copy the original file to the output folder
            shutil.copy(pdf_path, output_folder)
            print(f"{os.path.basename(pdf_file)} appears to be a single statement. Copying to folder: {os.path.basename(output_folder)}.\n")
        elif doc_starts:
            # Split the PDF into separate documents.
            split_pdf(pdf_path, output_folder, doc_starts)
        else:
            # Copy the PDF file to the manual processing folder if can't be split.
            print(f"Could not split {os.path.basename(pdf_file)}, copying to folder: {os.path.basename(manual_processing_folder)}.\n")
            shutil.copy(pdf_path, manual_processing_folder)
            # Create a manifest file for the unsplit files.
            manifest_path = os.path.join(manual_processing_folder, "manifest-of-unsplit-files.txt")
            with open(manifest_path, 'w') as manifest_file:
                manifest_file.write("Manifest of unsplit files:\n")
                for pdf_file in Path(manual_processing_folder).glob('*.pdf'):
                    manifest_file.write(f"{pdf_file.stem}\n")
            # Advise the user to manually split the file and add it to the split_files_folder.
            print(f" {os.path.basename(pdf_file)} will need to be manually split and placed in the {os.path.basename(output_folder)} on another extraction run. A manifest of unsplit files is in {os.path.basename(manual_processing_folder)}.")
    print(f"Splitting complete.\n\n")

def find_document_starts(pdf_path):
    """
    Identifies the starting pages of documents within a PDF file.

    This function scans through each page of a PDF file looking for a specific
    pattern ('1 of x') which denotes the beginning of a new document. It collects
    and returns the page numbers where new documents start.

    Args:
        pdf_path (str): The file path of the PDF to be processed.

    Returns:
        list: A list of page numbers where new documents start.
    """
    # Initialize a list to hold the starting pages of documents.
    doc_starts = []

    # Open the PDF file for processing.
    doc = fitz.open(pdf_path)

    # Iterate through each page in the PDF.
    for page_num in range(len(doc)):
        # Extract text from the current page.
        page_text = doc.load_page(page_num).get_text()

        # If the '1 of x' pattern is found, append the page number to the list.
        if re.search(r'\b1 of \d+', page_text):
            doc_starts.append(page_num)

    # Close the PDF after processing.
    doc.close()

def split_pdf(pdf_path, output_folder, doc_starts):
    """
    Splits a PDF into multiple documents based on the starting pages of each document.

    For each segment identified by the starting pages, a new PDF file is created
    in the specified output folder. The new files are named using the original
    PDF's name with a suffix indicating the document segment.

    Args:
        pdf_path (str): The file path of the PDF to be split.
        output_folder (str): The folder where the split PDFs will be saved.
        doc_starts (list): A list of page numbers where new documents start.
    """
    # Open the original PDF file.
    doc = fitz.open(pdf_path)
    total_pages = len(doc)
    pdf_name = Path(pdf_path).stem

    # Iterate through each document start page to split the PDF.
    for i, start_page in enumerate(doc_starts):
        # Determine the end page for the current document segment.
        end_page = doc_starts[i + 1] if i + 1 < len(doc_starts) else total_pages
        # Construct the output file path for the current segment.
        output_path = f"{output_folder}/{pdf_name}_statement_{i + 1}.pdf"

        # Create a new PDF for the current segment.
        new_doc = fitz.open()
        for page_num in range(start_page, end_page):
            new_doc.insert_pdf(doc, from_page=page_num, to_page=page_num)
        # Save the new PDF segment.
        new_doc.save(output_path)
        new_doc.close()
    # Close the original PDF.
    doc.close()

emdeh commented 3 months ago

To maintain modularity and minimise complexity in the existing function, the plan is to create a new function specifically for handling the new document type and patterns. Depending on the detected document type, this function can then be called by a super function or directly.

The outline is like this:

Super Function: This function determines the document type or splitting strategy needed based on initial analysis (e.g., scanning the first few pages for specific patterns). It then delegates to the appropriate processing function. The document type could also be determined by the user selection.
New Function for New Document Type: Tailored to handle the "Statement Number" and "Continued overleaf..." patterns, providing specific logic for splitting based on these cues.

This approach separates the logic for different document types, improving code clarity and making it easier to adapt to future requirements or document types.

emdeh commented 3 months ago

Created a new function called find_statement_number_starts() which deals with the use-case where there is no 1 of x pattern on the first page by looking for a regular expression present on the new document type. The regular expression is re.search(r'Statement number\s+(\d+)', page_text).

A new function named detect_document_type invokes this function in the core process_all_pdfs() function if it finds a keyword on the first three pages.

split_pdf has been updated to accommodate the new find function.

The existing find function has had its name updated.

emdeh / pdf-document-processor