beancount / smart_importer

Augment Beancount importers with machine learning functionality.
MIT License
248 stars 29 forks source link

Smart importer give duplicate asset postings #130

Open jbrok opened 6 months ago

jbrok commented 6 months ago

Hi, first of all, thanks for providing this great package. I've been moving all the families' small accounts to the Nordigen API in combination with smart importers.

There's a problem that I'm struggling to debug. Specifically, about 30% of the transactions from a specific API Importer (GoCardless/Nordigen) end up with three postings when using smart_importer. This issue doesn't occur with csvs, xls, etc.) Initially, I thought this was because of a bug in the Nordigen beancounttools importer but that doesn't seem the case. Here's an example:

2023-08-19 * "amazon.co.uk"
  creditorName: "Amazon.co.uk*1f37b5qz4"
  nordref: "64e135f0-75fa-XXXX-XXXXXX-XXXXXX"
  Expenses:Shopping
  Assets:Person1:Bank:Revolut:GBP <--- Randomly incorrectly added by smart_importer
  Assets:Person2:Bank:Revolut:GBP   -5.99 GBP

  2023-11-24 * "Cloudflare"
  nordref: "6560fd83-XXX-XXXXX-XXXX-XXXXX"
  creditorName: "Cloudflare"
  original: "EUR 4.32"
  Assets:Person1:Bank:Monzo:Checking <--- Randomly incorrectly added by smart_importer
  Expenses:Shopping
  Assets:Person2:Bank:Revolut:EUR    -4.32 EUR

It always seems to add an extra random Asset: posting. After researching a while ago I stumbled upon an smart_import caching issue but that issue was fixed.

My importer looks like this:

# Nordigen API accounts example
apply_hooks(nordigen.Importer(), [categories, PredictPostings(), DuplicateDetector(comparator=ReferenceDuplicatesComparator('nordref'), window_days=10)])

Removing PredictPostings() from here gives me the right results so I nailed it down to smart_importer adding the incorrect postings.

I call bean-extract like this:

# filter only .yaml files to debug the Nordigen issue and running it on one account
❯ bean-extract config.py ./import-files/*.yaml -e main.beancount > tmp.beancount && code tmp.beancount
DEBUG:smart_importer.predictor:Loaded training data with 22022 transactions for account , filtered from 22022 total transactions
DEBUG:smart_importer.predictor:Trained the machine learning model.
DEBUG:smart_importer.predictor:Apply predictions with pipeline
DEBUG:smart_importer.predictor:Added predictions to 82 transactions

For the last months, I've been removing the extra postings with a regex find&replace but recently I found out it also impacts deduplication so it doesn't duplicate those transactions. Not sure if it's because of how the API calls are made or if it's a smart_imported issue (seems the latter). I also tried to fork the code and limit the prediction on 1 posting, but that didn't help, it seems the wrong posting had still the highest prediction score.

My beancount file with training data doesn't contain errors, nor transactions with three postings (checked with bean-check and custom scripts).

Any ideas that can point me in the right direction to a solution? Much appreciated!

dev590t commented 1 month ago

Similar issue for me, smart_importer works when I import from local ofx file, but give me three postings for some transaction when using smart_importer with tarioch/beancounttools Nordigen importer.

And it isn't easy to debug because I have only 4 API call per day in my Nordigen account. tarioch/beancounttools Nordigen importer download and transform the data with the same function : extract(). Maybe I should make a PR to rewrite this function into 2 functions separately for easier debugging.