Smart importer give duplicate asset postings

Hi, first of all, thanks for providing this great package. I've been moving all the families' small accounts to the Nordigen API in combination with smart importers.

There's a problem that I'm struggling to debug. Specifically, about 30% of the transactions from a specific API Importer (GoCardless/Nordigen) end up with three postings when using smart_importer. This issue doesn't occur with csvs, xls, etc.) Initially, I thought this was because of a bug in the Nordigen beancounttools importer but that doesn't seem the case. Here's an example:

2023-08-19 * "amazon.co.uk"
  creditorName: "Amazon.co.uk*1f37b5qz4"
  nordref: "64e135f0-75fa-XXXX-XXXXXX-XXXXXX"
  Expenses:Shopping
  Assets:Person1:Bank:Revolut:GBP <--- Randomly incorrectly added by smart_importer
  Assets:Person2:Bank:Revolut:GBP   -5.99 GBP

  2023-11-24 * "Cloudflare"
  nordref: "6560fd83-XXX-XXXXX-XXXX-XXXXX"
  creditorName: "Cloudflare"
  original: "EUR 4.32"
  Assets:Person1:Bank:Monzo:Checking <--- Randomly incorrectly added by smart_importer
  Expenses:Shopping
  Assets:Person2:Bank:Revolut:EUR    -4.32 EUR

It always seems to add an extra random Asset: posting. After researching a while ago I stumbled upon an smart_import caching issue but that issue was fixed.

My importer looks like this:

# Nordigen API accounts example
apply_hooks(nordigen.Importer(), [categories, PredictPostings(), DuplicateDetector(comparator=ReferenceDuplicatesComparator('nordref'), window_days=10)])

Removing PredictPostings() from here gives me the right results so I nailed it down to smart_importer adding the incorrect postings.

I call bean-extract like this:

# filter only .yaml files to debug the Nordigen issue and running it on one account
❯ bean-extract config.py ./import-files/*.yaml -e main.beancount > tmp.beancount && code tmp.beancount
DEBUG:smart_importer.predictor:Loaded training data with 22022 transactions for account , filtered from 22022 total transactions
DEBUG:smart_importer.predictor:Trained the machine learning model.
DEBUG:smart_importer.predictor:Apply predictions with pipeline
DEBUG:smart_importer.predictor:Added predictions to 82 transactions

For the last months, I've been removing the extra postings with a regex find&replace but recently I found out it also impacts deduplication so it doesn't duplicate those transactions. Not sure if it's because of how the API calls are made or if it's a smart_imported issue (seems the latter). I also tried to fork the code and limit the prediction on 1 posting, but that didn't help, it seems the wrong posting had still the highest prediction score.

My beancount file with training data doesn't contain errors, nor transactions with three postings (checked with bean-check and custom scripts).

Any ideas that can point me in the right direction to a solution? Much appreciated!

beancount / smart_importer

Smart importer give duplicate asset postings #130