Post-processing task - Find and move duplicate files

emdeh commented 1 day ago

Description

Create a postprocessing task that identifies duplicate statements by checking if files have the same statement number, statement start date, and account number. If duplicates are found, move the duplicate files to a duplicates folder within the account-specific subfolders.

To-Do List Overview:

Implement Statement Number Extraction:

[ ] In PDFPostprocessor, add a method extract_statement_number(self, pdf_path).

Add Task to Task Registry:

[ ] Register a new task named identify_and_move_duplicates in the task_registry.

Create Unique Identifiers:

[ ] Combine the extracted account number, statement start date, and statement number to form unique identifiers.

Detect and Organize Duplicates:

[ ] In the task function, identify duplicates and move them to a duplicates folder.

Outline of New Code and Placement:

Add Statement Number Extraction Method to `PDFPostprocessor`:

[ ] Implement extract_statement_number.

Implement the Task Function:

[ ] Add a static method identify_and_move_duplicates in PDFPostprocessor.

Update Task Registry:

[ ] Add identify_and_move_duplicates to the task_registry.

Other considerations

Given we need to handle various statement formats, the code will need to prompt the user to input examples from which a regex pattern could be generated.
Eventually, we could implement this into the yaml config file.

emdeh commented 1 day ago

Have outlined the following methods relating to finding and moving duplicate files from the split files (all in the PDFPostProcessor class in the postprocess_utils.py file.

Done under Post-processing task - Group split files by account number as the general outline as some will be used across tasks.

def get_pattern_from_user(self, field_name): Prompts the user for an example and generates a regex pattern.
def generate_regex_from_example(self, example): Generates a regex pattern from a user-provided example.
def extract_field(self, pdf_path, pattern): Extracts a field from a PDF using a regex pattern.
def identify_and_move_duplicates(self): Identify duplicate statements and move them to a duplicates folder.

Other notes: We will need to potentially call existing methods in pdf_processor.py to read the pdf (whether it is machine-readable or needs OCR)

emdeh commented 1 day ago

Once development on Post-processing task - Group split files by account number has created the shared methods, create a new development branch for this issue, making sure to create it from that Issue's branch (not main).

emdeh commented 11 hours ago

Edit: I have created a base-feature-iteration2 branch and merged changes from this feature branch into it. This means the "feature" branches for this task and https://github.com/emdeh/pdf-document-processor/issues/55 can be created from the base-feature-iteration2 branch.

Then, we PR the features into that base branch, then the base branch into main

emdeh / pdf-document-processor