Open emdeh opened 4 days ago
Have outlined the following methods relating to appending the date range to filenames of split files (all in the PDFPostProcessor
class in the postprocess_utils.py
file.
Done under Post-processing task - Group split files by account number as the general outline as some will be used across tasks.
def get_pattern_from_user(self, field_name):
Prompts the user for an example and generates a regex pattern.def generate_regex_from_example(self, example):
Generates a regex pattern from a user-provided example.def extract_field(self, pdf_path, pattern):
Extracts a field from a PDF using a regex pattern.def format_date(self, date_str):
Formats a date string extracted from a PDF.def add_date_prefix_to_filenames(self):
Add statement start date as prefix to PDF filenames for chronological ordering.Other notes: We will need to potentially call existing methods in pdf_processor.py to read the pdf (whether it is machine readable or needs OCR)
Once development on Post-processing task - Group split files by account number has created the shared methods, create a new development branch for this issue, making sure to create it from that Issue's branch (not main).
Once development on Post-processing task - Group split files by account number has created the shared methods, create a new development branch for this issue, making sure to create it from that Issue's branch (not main).
Edit: I have created a base-feature-iteration2
branch and merged changes from this feature branch into it. This means the "feature" branches for this task and #56 can be created from the base-feature-iteration2 branch.
Then, we PR the features into that base branch, then the base branch into main
Description
Create a postprocessing task that adds the statement start date to the filenames of the split PDF files. The task should prefix the filenames within the account-specific subfolders with the date in
YYYYMMDD
format so that files can be ordered chronologically.To-Do List Overview:
Implement Statement Start Date Extraction:
PDFPostprocessor
, develop a methodextract_statement_start_date(self, pdf_path)
.Add Task to Task Registry:
add_date_prefix_to_filenames
in the task_registry.Format Dates Consistently:
YYYYMMDD
.Rename Files to Include Date Prefix:
Outline of New Code and Placement:
Add Date Extraction Method to
PDFPostprocessor
:extract_statement_start_date
.Implement the Task Function:
add_date_prefix_to_filenames
inPDFPostprocessor
.Update Task Registry:
Add
add_date_prefix_to_filenames
to the task_registry.Other considerations