Pass importer results to next importer in bean-extract

blais commented 6 years ago

Original report by Christoph Sarnowski (Bitbucket: csarn, GitHub: csarn).

I'm writing a couple of importers for personal use, and I am missing this feature.

Rationale: bean-extract runs all found (and supported) documents through their importer's extract method in one call. It also has a mechanism to flag duplicate transactions, but only if an existing beancount file is given. Duplicate transactions happen close to each other in time, so it will be very common that both parts of a duplication will be imported at the same time. Imagine I transfer money from one bank account to another, and I download the CSVs from both banks. Now bean-extract will find both sides of this transaction, but it can't detect them as duplicates.

So I would like to see bean-extract run the importer for one file, append the transactions to the list of existing entries, and then pass this updated list to the next importer run.

In case that this behavior is not universally useful, I'd suggest to add a command line switch to bean-extract to enable this.

Any comments or suggestions? Would you accept this feature?

blais commented 5 years ago

Original comment by Johannes Harms (Bitbucket: johannesjh, GitHub: johannesjh).

bean-extract runs all found (and supported) documents through their importer's extract method in one call.

Not necessarily. It is possible to run bean-extract on just a single file, e.g., on ~/Downloads/data.csv instead of on ~/Downloads

I transfer money from one bank account to another, and I download the CSVs from both banks. Now bean-extract will find both sides of this transaction, but it can't detect them as duplicates.

Two workarounds:

First extract data from just one account and add the extracted transactions to your existing data. Then call bean-extract again to extract data from the other account. Bean-extract is then able to find the duplicates. (Note: Fava's web GUI also guides you to import one file after the other, instead of all files at once).
Use transfer accounts if you transfer money between two of your own accounts. This eliminates duplicates if you happen to import transactions from both accounts. For example:

#!beancount
; While importing transactions from account A: 
; If you encounter a transaction coming from or going to another account of yours, 
; then use a transfer account in the second posting, like this:
2018-12-28 * "From account A to account B"
  Assets:Bank:A -100 EUR
  Assets:Transfer

; While importing transactions from account B: 
; If you encounter a transaction coming from or going to another account of yours, 
; then use a transfer account in the second posting, like this:
2018-12-30 * "From account A to account B"
  Assets:Bank:B 100 EUR
  Assets:Transfer

dnicolodi commented 3 years ago

@csarn There is work underway to restructure the importers interface and the internals of the importing mechanism. This is a good time to think about supporting new features or tweaking existing features as is proposed in this ticket.

Thinking about implementing what you proposed, I see one possible issue: the order in which documents are imported is not deterministic as it depends on which order files are discovered (please note that any sorting strategy based on file names is not going to help here because the file names can be whatever). This results in a very hard to explain behavior of the deduplication mechanism. While now newly imported transactions are marked as duplicates if a similar transaction is found in the existing ledger, with the behavior you suggest, transactions would be marked as duplicates if a similar transaction is found in either the existing ledger or in a pseudo-random subset of the transactions that are part of the current ingest run. This makes it such that it is not possible to know if earlier-in-time or later-in-time transactions are marked as duplicates. As a consequence, balance directives in the imported ledger would fail randomly.

More I think about it, more it seems preferable to move deduplication into his own bean-gadget uniq tool. I think I'll be prototyping that soon.

blais commented 3 years ago

About dedup: I think it makes sense to think of deduplication as importer specific, because some of the time you have unique ids and when you do, those are the best way to do this.

On Tue, Mar 2, 2021, 08:13 Daniele Nicolodi notifications@github.com wrote:

@csarn https://github.com/csarn There is work underway to restructure the importers interface and the internals of the importing mechanism. This is a good time to think about supporting new features or tweaking existing features as is proposed in this ticket.

Thinking about implementing what you proposed, I see one possible issue: the order in which documents are imported is not deterministic as it depends on which order files are discovered (please note that any sorting strategy based on file names is not going to help here because the file names can be whatever). This results in a very hard to explain behavior of the deduplication mechanism. While now newly imported transactions are marked as duplicates if a similar transaction is found in the existing ledger, with the behavior you suggest, transactions would be marked as duplicates if a similar transaction is found in either the existing ledger or in a pseudo-random subset of the transactions that are part of the current ingest run. This makes it such that it is not possible to know if earlier-in-time or later-in-time transactions are marked as duplicates. As a consequence, balance directives in the imported ledger would fail randomly.

More I think about it, more it seems preferable to move deduplication into is own bean-gadget uniq tool. I think I'll be prototyping that soon.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/beancount/beangulp/issues/9#issuecomment-788898958, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACSBE24AUKBOK2CWFINKLLTBTP6NANCNFSM4W6BDJ2Q .

dnicolodi commented 3 years ago

I agree. I'm thinking about extending the importers interface to allow the importers to specify a deduplication function.

dnicolodi commented 3 years ago

Funny, it seems that there were some intent to support the use case in this issue: https://github.com/beancount/beangulp/blob/master/beangulp/extract.py#L168 however the code does not actually do it.

blais commented 3 years ago

Gosh who knows what happens in my brains I can't even remember what I ate for lunch

dnicolodi commented 1 year ago

Implemented in #64.

beancount / beangulp

Pass importer results to next importer in bean-extract #9