ReceiptManager / receipt-parser-legacy

A supermarket receipt parser written in Python using tesseract OCR
https://tech.trivago.com/2015/10/06/python_receipt_parser/
Apache License 2.0
806 stars 198 forks source link

allow custom (per-market) regex config #159

Closed smallstepman closed 2 years ago

smallstepman commented 2 years ago

Proposed change allows the user to provide following the config.yml

language: deu

receipts_path: "data/txt"

results_as_json: false

markets:
  Colruyt:
     - colruyt
     - Colruyt
  Migros:
     - genossenschaft migros
  Metro:
     - vetro
     - metro

sum_keys:
  - summe

sum_keys_metro:
  - something
  - sumthing

ignore_keys:
  - mwst

ignore_keys_migros:
  - something something

sum_format: '\d+(\.\s?|,\s?|[^a-zA-Z\d])\d{2}'
sum_format_colruyt: '[0-9a-f]*'

item_format: '([a-zA-Z].+)\s(-|)((\d|\d{2}),(\d{2}|\d{3}))\s'
item_format_metro: '[0-9]\s(.*?)\d.()((\d|\d{2})(\,|\.)\d{1,2})\s([A|a]|[B|b])'
item_format_migros: '[0-9a-f]*'

date_format: '((\d{2}\.\d{2}\.\d{2,4})|(\d{2,4}\/\d{2}\/\d{2})|(\d{2}\/\d{2}\/\d{4}))'

meaning, the script will respect custom, per-market configs.

monolidth commented 2 years ago

Great suggestion and implementation @smallstepman!. I really appreciate such contributions. One minor question: in commit: https://github.com/ReceiptManager/receipt-parser-legacy/pull/159/commits/7f05c66309d59f530bfc44c745c72cd20c4ca392 . I have seen that you remove the replace statement. What is the reason for that?

Greetings from Karlsruhe ;)

sonarcloud[bot] commented 2 years ago

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

smallstepman commented 2 years ago

thanks! oops, not sure what happened with the replace, I reversed the changes

monolidth commented 2 years ago

Wow that was fast, thanks for you contributions flash