emdeh / pdf-document-processor

0 stars 0 forks source link

Extraction functions may be unnecessarily manipulating data types #7

Open emdeh opened 4 months ago

emdeh commented 4 months ago

The process_transactions() and the extract_summary_values_and_confidence() functions contain logic to convert string values to numbers if they are amounts.

https://github.com/emdeh/pdf-document-processor/blob/cb5414a78a2193739ad979a60229cb0f8fb3e90e/src/csv_utils.py#L44-L53

https://github.com/emdeh/pdf-document-processor/blob/cb5414a78a2193739ad979a60229cb0f8fb3e90e/src/csv_utils.py#L97-L111

Statement models trained from the 4/4/24 set the value within Custom Extraction Models themselves, which means this logic may no longer be required.

However, it has to be checked that the values are indeed written as the correct data type without this logic. The way in which the extracted data is returned by analyse_document() may need to be reviewed as well.

If the logic is indeed redundant, the Amex model will need to be retrained with the correct data types set on the labels.