emdeh / pdf-document-processor

0 stars 0 forks source link

Post-processing task - Find and move duplicate files #56

Open emdeh opened 1 day ago

emdeh commented 1 day ago

Description

Create a postprocessing task that identifies duplicate statements by checking if files have the same statement number, statement start date, and account number. If duplicates are found, move the duplicate files to a duplicates folder within the account-specific subfolders.

To-Do List Overview:

Implement Statement Number Extraction:

Add Task to Task Registry:

Create Unique Identifiers:

Detect and Organize Duplicates:

Outline of New Code and Placement:

Add Statement Number Extraction Method to PDFPostprocessor:

Implement the Task Function:

Update Task Registry:


Other considerations

emdeh commented 1 day ago

Have outlined the following methods relating to finding and moving duplicate files from the split files (all in the PDFPostProcessor class in the postprocess_utils.py file.

Done under Post-processing task - Group split files by account number as the general outline as some will be used across tasks.

Other notes: We will need to potentially call existing methods in pdf_processor.py to read the pdf (whether it is machine-readable or needs OCR)

emdeh commented 1 day ago

Once development on Post-processing task - Group split files by account number has created the shared methods, create a new development branch for this issue, making sure to create it from that Issue's branch (not main).

emdeh commented 11 hours ago

Edit: I have created a base-feature-iteration2 branch and merged changes from this feature branch into it. This means the "feature" branches for this task and https://github.com/emdeh/pdf-document-processor/issues/55 can be created from the base-feature-iteration2 branch.

Then, we PR the features into that base branch, then the base branch into main