Open emdeh opened 1 day ago
Have outlined the following methods relating to finding and moving duplicate files from the split files (all in the PDFPostProcessor
class in the postprocess_utils.py
file.
Done under Post-processing task - Group split files by account number as the general outline as some will be used across tasks.
def get_pattern_from_user(self, field_name):
Prompts the user for an example and generates a regex pattern.def generate_regex_from_example(self, example):
Generates a regex pattern from a user-provided example.def extract_field(self, pdf_path, pattern):
Extracts a field from a PDF using a regex pattern.def identify_and_move_duplicates(self):
Identify duplicate statements and move them to a duplicates folder.Other notes: We will need to potentially call existing methods in pdf_processor.py to read the pdf (whether it is machine-readable or needs OCR)
Once development on Post-processing task - Group split files by account number has created the shared methods, create a new development branch for this issue, making sure to create it from that Issue's branch (not main).
Edit: I have created a base-feature-iteration2 branch and merged changes from this feature branch into it. This means the "feature" branches for this task and https://github.com/emdeh/pdf-document-processor/issues/55 can be created from the base-feature-iteration2 branch.
Then, we PR the features into that base branch, then the base branch into main
Description
Create a postprocessing task that identifies duplicate statements by checking if files have the same statement number, statement start date, and account number. If duplicates are found, move the duplicate files to a duplicates folder within the account-specific subfolders.
To-Do List Overview:
Implement Statement Number Extraction:
PDFPostprocessor
, add a methodextract_statement_number(self, pdf_path)
.Add Task to Task Registry:
identify_and_move_duplicates
in the task_registry.Create Unique Identifiers:
Detect and Organize Duplicates:
Outline of New Code and Placement:
Add Statement Number Extraction Method to
PDFPostprocessor
:extract_statement_number
.Implement the Task Function:
identify_and_move_duplicates
inPDFPostprocessor
.Update Task Registry:
identify_and_move_duplicates
to the task_registry.Other considerations