How to pass in existing entries as training data other than the current bean file

mondjef commented 1 year ago

I have smart importers setup, decorated and working...however I recently changed my beancount file structure from a single file to multiple files by year for a number of reasons. As such, I have a 'main' beancount file that I point fava at that includes some global files along with the current year beancount file where new entries are entered. At the beginning of every year I don't want my smart importer training on no or limited data but instead want to point the smart importer to an alternative beancount file that includes all historical entries for training.

I have poured over the code and tried all sorts of things to pass in this alternative training file without luck...I can't even get it to try to load it as any parameter I have tried thus far results in an unexpected argument error being raised so I know I am not going it right.

I saw issue #61 that also discusses this but still could not connect the dots.

Can someone please provide more guidance as what needs to be done or provide a bit more info in the documentation around this.

mondjef commented 1 year ago

Currently working with this config file....

import sys
from os import path

from beancount.ingest import extract

sys.path.insert(0, path.join(path.dirname(__file__)))
from importers import simplii
from FingerPrintDuplicatesComparator import DuplicatesComparator

from smart_importer import apply_hooks, PredictPayees, PredictPostings
from smart_importer.detector import  DuplicateDetector

simplii = simplii.SimpliiImporter()
apply_hooks(simplii,[PredictPostings(), PredictPayees(), DuplicateDetector(comparator=DuplicatesComparator())])

CONFIG = [
    simplii
]

johannesjh commented 1 year ago

hi, existing entries can be specified as training data when calling bean-extract, see https://github.com/beancount/smart_importer#specifying-training-data ...does this help with what you want do achieve?

mondjef commented 1 year ago

hi, existing entries can be specified as training data when calling bean-extract, see https://github.com/beancount/smart_importer#specifying-training-data ...does this help with what you want do achieve?

hi, not really.... I have read that but cannot understand where and how to feed the training data into the importer decorated by the smart import hooks...i.e. what do I need to do, where, and what parameter is used to feed it in. I have tried 'training_data' and 'existing_entries' without success in many places.

bean-extract....this is a command line tool no? How can this be integrated into my python smart importer? Is the workflow more like....use bean-extract tool to read and process existing transactions then to take its output and some how use in smart importer? As you can see I am a bit lost and confused with this part....

johannesjh commented 1 year ago

Some explanations about how the tools interact:

bean-extract is beancount's commandline tool for importing transactions from csv or other sources. when using this tool, you can specify the --existing <BEANCOUNT_FILE> argument. the existing entries from this file are then used in the entire import process. this works out-of-the-box, no need to configure/program anyting regarding existing entries in your import config file. the flow is as follows:
- bean-extract reads existing entries from the beancount file specified in the --existing argument.
- bean-extract will pass these existing entries to the importer
- when bean-extract invokes your importer, the smart_importer hook (that you applied to your importer) is called. the hook receives the existing entries as argument and uses them to train the machine learning model.
- the hook modifies the output produced by the original importer, e.g., to predict postings and payees.
- bean-extract writes the output (i.e., the imported entries) to stdout.
- it's up to you to copy/write/append the output into your beancount files.
You can alternatively use fava instead of bean-extract. fava assumes that the file you are viewing/editing already contains the existing entries. the flow is otherwise very similar to bean-extract:
- fava passes entries from the currently opened file as "existing entries" to the importer.
- when fava invokes your importer, the smart_importer hook (that you applied to your importer) is called. the hook receives the existing entries as argument and uses them to train the machine learning model.
- the hook modifies the output produced by the original importer, e.g., to predict postings and payees.
- fava appends the output (i.e., the imported entries) to the currently opened main beancount file by default. alternatively, you can mark the place where you want imported entries to be written using fava's insert-entry option, see fava's import help page and some more explanation in fava #1262.

johannesjh commented 1 year ago

A suggestion regarding your file structure, my setup works like this, and it might also work for you.

I have the following file and folder structure:

main.beancount (this is the overall main beancount file, it includes each and every year)
2022/2022.beancount (this is the main file for year 2022, obviously... it may include further sub-files)
2023/2023.beancount (this is the main file for year 2023)

And I use smart_importer together with fava like this:

I open main.beancount in fava.
I trigger the import process using fava's gui.
Fava automatically passes all entries (from all years) to the importer.
The importer receives all these existing entries from fava.
In the same way, the smart_importer hook (that I applied to my importer) receives all the existing entries and uses them to train the machine learning model.

mondjef commented 1 year ago

my file/folder setup is similar to yours with the exception that my main.beancount file points to a single fiscal file and I have an alternative main.beancount file that is exactly like yours which includes all prior fiscal files (or selected ones) that I want to use for ML training so that I can filter/restrict what is feed as training.

Ok, this is a bit more clear now to me....as I use fava and want to have everything remain in the work flow that fava uses for importing I would need a way to ignore what fava passes as 'existing entries' and replace with want I want. My guess I would need to do this somewhere in my importer itself?

johannesjh commented 1 year ago

My guess I would need to do this somewhere in my importer itself?

maybe... but you are leaving the paths of what seems to be the default / recommended usage of fava, so this will certainly take some tinkering.

...as I use fava and want to have everything remain in the work flow that fava uses for importing...

Yes, keeping everything aligned with fava's intended workflow is exactly what I would recommend to you as well. Taking this philosophy one step further: Simply open your main.beancount file (which includes all other files) in fava. Problem solved, no need to tinker. :-)

Let's close this ticket. (I don't think the smart_importer project can or should change for what we've been discussing in this ticket).

johannesjh commented 1 year ago

PS just to make sure, I guess (I hope) you are aware of the fact that fava allows you to edit any file referenced by the main file? the file chooser dropdown is in the top left corner:

fava-multiple-files

mondjef commented 1 year ago

I was not aware of the file chooser in fava and I don't seem to have this in my fava instance even though I have include statements that reference other files in the beancount that I am currently pointing fava at. Is there anything special that needs to be done to have this option show in fava? In light of this option I will have to rethink a bit my strategy to avoid tinkering with fava as little as possible.

edit: from what I can tell there might be a short coming of the docker version of fava and the environment variable that is passed in at time of container creation to indicate the bean file. I can get the file chooser to show up if I supply multiple files in this environment variable separated by a comma and space (does not work without the space) and surrounded by quotes. However, it does not work well...it appears as though the files are just being clobbered dependent on the order so I am not sure the docker image supports this feature.

johannesjh commented 1 year ago

right, this was an older screenshot taken from the internet, sorry.

I tried again using a recent version of fava on my local machine... it looks like the file chooser has moved to the editor's file menu, as in this screenshot:

beancount / smart_importer

How to pass in existing entries as training data other than the current bean file #121