This detect_document_type() function looks for keywords on the first 3 pages to determine whether the pdfs should be processed by:
find_standard_statement_starts(pdf_path): # Standard function for statements that have '1 of x' pattern
or
find_statement_numbers_starts(pdf_path): # New function for Bendigo Bank statements that don't have '1 of x' pattern
If the keyword of one type happens to be present on the other type, errors could be introduced.
Suggest the user prompt that determines the document type from the yaml file be brought back to the top of runtime so that can be used to set the correct pdf processing functions.
Context
This
detect_document_type()
function looks for keywords on the first 3 pages to determine whether the pdfs should be processed by:find_standard_statement_starts(pdf_path): # Standard function for statements that have '1 of x' pattern
orfind_statement_numbers_starts(pdf_path): # New function for Bendigo Bank statements that don't have '1 of x' pattern
If the keyword of one type happens to be present on the other type, errors could be introduced.
Relevant code
https://github.com/emdeh/pdf-document-processor/blob/10208d3eb26da23b89a6d83a6e1157cb3e41c530/src/pdf_processor.py#L50-L63
Potential solution
Suggest the user prompt that determines the document type from the yaml file be brought back to the top of runtime so that can be used to set the correct pdf processing functions.