beancount / beangulp

Importers framework for Beancount
GNU General Public License v2.0
61 stars 24 forks source link

Deduplication across several input files #93

Closed nicdumz closed 1 year ago

nicdumz commented 3 years ago

Imagine a setup with:

documents/foo.csv
documents/bar.csv

And a static importer such as:

class StaticImporter(Importer):
  """No matter the file, identify it and yield the same transaction."""
  def identify(self, filepath):
    return True
  def account(self, filepath):
    return 'Dummy'
  def extract(self, filepath):
    return [DUP_TRANSACTION]

This situation is a simplified version of CSV reports with overlapping dates, something that happen if you're not carefully doing CSV exports at specific time boundaries. (e.g. export Jan-June.csv, then March-Sept.csv).

Running beangulp.Ingest (extract -o out.beancount) will create the following out.beancount with duplicates:

;; -*- mode: beancount -*-

**** documents/foo.csv

DUP_TRANSACTION

**** documents/bar.csv

DUP_TRANSACTION

I believe this happens because the _extract code only compares dups against already existing entries, but not against new files returned from walk().

I think it could instead be accumulating new entries when calling extract.extract_from_file(importer, filename, existing_entries), and instead call extract.extract_from_file(importer, filename, accumulated_existing_entries), so that duplicates across files are detected.

Thanks!

dnicolodi commented 3 years ago

Two bugs, two tickets, please. Otherwise discussing them is a mess. The first bug is solved in #64. I don't understand the second part: given the same inputs beangulp produces the same identical output. If this property would not hold it would be impossible to test it. It it impossible to get that output with the code currently in the master branch, thus I cannot reproduce.

nicdumz commented 3 years ago

Or we could consider #94 as a relatively simpler fix than #64 maybe?

nicdumz commented 3 years ago

(Split the second issue to #95)

dnicolodi commented 1 year ago

Fixed in #64.