emdeh / pdf-document-processor

0 stars 0 forks source link

The way document types are detected to begin splitting is prone to error #16

Open emdeh opened 3 months ago

emdeh commented 3 months ago

Context

This detect_document_type() function looks for keywords on the first 3 pages to determine whether the pdfs should be processed by:

find_standard_statement_starts(pdf_path): # Standard function for statements that have '1 of x' pattern or find_statement_numbers_starts(pdf_path): # New function for Bendigo Bank statements that don't have '1 of x' pattern

If the keyword of one type happens to be present on the other type, errors could be introduced.

Relevant code

https://github.com/emdeh/pdf-document-processor/blob/10208d3eb26da23b89a6d83a6e1157cb3e41c530/src/pdf_processor.py#L50-L63

Potential solution

Suggest the user prompt that determines the document type from the yaml file be brought back to the top of runtime so that can be used to set the correct pdf processing functions.