beancount / smart_importer

Augment Beancount importers with machine learning functionality.
MIT License
248 stars 29 forks source link

apply decorator to importer class #9

Closed tarioch closed 6 years ago

tarioch commented 6 years ago

If I understood it right, beancount comes with a csv parser which can be configured and you don't need to actually subclass it. So I think this would be simply a configuration in your foo.import file. It would be nice if we could then configure it somehow to add smart importer functionality without having to extend the existing importer to add the interceptor.

johannesjh commented 6 years ago

I agree. For me, that's the central question for this whole undertaking, namely: how to integrate with beancount and fava. What I mean by that is, your considerations, how I understand them, go very much in the same direction as #7 and beancount/fava#579, i.e., now that we have the functionality for enhancing imports through machine learning, how should this functionality be integrated with beancount and fava, and how should it be accessible to end users? Some goals for this integration:

In summary, I think we need to go talk with beancount and fava about this. It seems to me that beancount/fava#579 would be a suitable place for such discussion, or we could open a ticket in beancount/beancount.

tarioch commented 6 years ago

I'll still have to dig a bit deeper, but I think there are at least some classes around in beancount

https://bitbucket.org/blais/beancount/src/810f839692f7a7e89a488b20217f60eeeddb4474/beancount/ingest/importers/?at=default

for dealing with CSV and OFX. The way it looks is you can instantiate them in a my.import config the way you need it. The way beancount does my.beancount it's currently python as well, so shouldn't be too hard to allow to wrap the importer with the predictPostings/predictPayees. So given the following situation: I know how to program in python, I would like to instantiate beancount.ingest.importer.Csv and have it wrapped by the predictPostings. How would I do this without modifying beancount.ingest.importer.Csv?

As for general improvements to import configs. I could see a simple tool which might generate the config the way you need it. Similar to the import wizard of Gnucash, first you specify things like separator, date formats and then you pick the columns you're interested in and define which kind of data in each column is. Now add to that the option of "autocomplete opposite site of the transaction".

johannesjh commented 6 years ago

How would I do this without modifying beancount.ingest.importer.Csv?

well, something along those lines...

import os
from beancount.ingest import importer, cache
from beancount.ingest.cache import _FileMemo
from smart_importer.predict_postings import PredictPostings

class MyImporter(importer.ImporterProtocol):
    '''
    This is my importer, possibly inheriting from or using beancount.ingest.importer.Csv,
    or otherwise written by hand.

    Note: This importer purposely includes no machine learning, 
    in order to make it easier to do regression testing using `beancount.ingest.regression`.
    '''
    # whatever implementation you see fit, this is a regular beancount importer
    pass

class MySmartImporter(MyImporter):
    '''
    A smart version of the importer.
    '''

    @PredictPostings(
        training_data=cache.get_file(
            os.path.abspath(os.path.join(
                os.path.dirname(__file__), 'myfile.beancount'))
        ),
        filter_training_data_by_account='Assets:My:Account'
    )
    def extract(self, file: _FileMemo):
        return super().extract(file)

As for general improvements to import configs. I could see a simple tool...

yes, exactly!

tarioch commented 6 years ago

What I was thinking is to not having to subclass the importer at all. I'm still pretty fresh with python so maybe I'm missing the obvious thing, right now the way I see to be able to do this would be to create a generic class "SmartImporter" which takes an instance of another importer and would look something like this:

class SmartImporter(importer.ImporterProtocol):
    def __init(self, real_importer: importer.ImporterProtocol)

...
    def extract(self, *args, **kwargs):
        self.real_importer.extract(args, kwargs)
...

and then the usage in the import config would simply be

CONFIG = [
    SmartImporter(importers.csv.Importer(...))
]

What feels a bit off is that @PredictPostings/@PredictPayees is already something which wraps around a real implementation so having to do this again manually feels strange. I'm wondering if there isn't a better way to instantiate @PredictPostings/@PredictPayees from the outside.

johannesjh commented 6 years ago

I am not a Python guru either, but what you are suggesting ("a generic class "SmartImporter" which takes an instance of another importer") sounds a bit like class decorators, compare e.g., https://www.codementor.io/sheena/advanced-use-python-decorators-class-function-du107nxsv#decorating-classes

Since decorators in pythons are simple Callables, it should be possible to call them directly instead of annotating them to a class. In other words, you are free to either decorate the class, as in the following example:

@PredictPostings
class MyCustomCSVImporter(CsvImporter):
    pass

CONFIG = [
    MyCustomCSVImporter(whatever, arguments)
]

...or to directly call the same decorator with an importer instance, inline in your config file, if you wish, as follows:

CONFIG = [
    PredictPostings(CsvImporter(whatever, arguments))
]

Decorating the importer classes instead of the extract methods would also be promising for another reason: You brought up in issue #10 the fact that beancount's importers can now officially be instantiated with pre-existing entries. This information would of course be highly interesting for the machine learning functionality! - By decorating the importer class instead of just the extract method, the decorator and thus the machine learning could maybe access this information? (I have not checked the beancount code yet to see if that would be possible).

tarioch commented 6 years ago

12 Is using exactly this, now I annotate my importers simply with @PredictPostings() and it will automatically use my existing entries as training data :)

johannesjh commented 6 years ago

oh right.

but still: if we would decorate importer classes instead of extract functions, this would allow you to write...

CONFIG = [
    PredictPostings(CsvImporter(whatever, arguments))
]

...because that's what you originally wanted in this ticket, isn't it?

tarioch commented 6 years ago

Yes exactly.

johannesjh commented 6 years ago

I spent a few hours (took me much longer than I anticipated) to refactor predict_postings into a decorator for classes. New usage, with the decorator applied to the class:

@PredictPostings(training_data=self.training_data)
class DummyImporter(ImporterProtocol):
    def extract(self, file, existing_entries=None):
        return the.imported.transactions

The refactoring of the second decorator for predicting payees is still missing. But would you mind having a look at the new code in branch "decorator-for-importer-classes", I would welcome your feedback, thank you! Johannes

tarioch commented 6 years ago

I'm probably doing something wrong but I'm getting the following exception with my importer

from dateutil.parser import parse

from beancount.ingest import importer
from beancount.core import data
from beancount.core import amount
from beancount.core.number import D
from beancount.ingest.importers import regexp

from smart_importer.predict_postings import PredictPostings

@PredictPostings()
class Importer(regexp.RegexpImporterMixin, importer.ImporterProtocol):
    """An importer for account statements."""

    def __init__(self, regexps):
        if isinstance(regexps, str):
            regexps = [regexps]
        regexp.RegexpImporterMixin.__init__(self, regexps)

    def file_account(self, file):
        return 'Assets:Foo'

    def extract(self, file, existing_entries):
        entries = []
        meta = data.new_metadata(file.name, 0)
        txn = data.Transaction(
                meta,
                parse('2017-11-20').date(),
                '*',
                None,
                'Two Postings',
                data.EMPTY_SET,
                data.EMPTY_SET,
                [
                    data.Posting('Assets:Patrick:CHF', amount.Amount(D('12'), 'CHF'), None, None, None, None),
                    data.Posting('Assets:Patrick:USD', amount.Amount(D('12'), 'CHF'), None, None, None, None),
                    ]
                )

        entries.append(txn)
        txn = data.Transaction(
                meta,
                parse('2017-11-20').date(),
                '*',
                None,
                'Single Posting',
                data.EMPTY_SET,
                data.EMPTY_SET,
                [
                    data.Posting('Assets:Patrick:CHF', amount.Amount(D('12'), 'CHF'), None, None, None, None),
                    ]
                )

        entries.append(txn)

        return entries
Traceback (most recent call last):
  File "/home/patrickr/.local/share/virtualenvs/ledger-yg7yJ6bm/bin/bean-extract", line 6, in <module>
    exec(compile(open(__file__).read(), __file__, 'exec'))
  File "/home/patrickr/projects/beancount/bin/bean-extract", line 4, in <module>
    from beancount.ingest.extract import main; main()
  File "/home/patrickr/projects/beancount/beancount/ingest/extract.py", line 192, in main
    args, config, downloads_directories = scripts_utils.parse_arguments(parser)
  File "/home/patrickr/projects/beancount/beancount/ingest/scripts_utils.py", line 56, in parse_arguments
    mod = runpy.run_path(args.config)
  File "/usr/lib/python3.6/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/usr/lib/python3.6/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "main.import", line 8, in <module>
    from importers.bitst import importer as bitstimp
  File "/home/patrickr/ledger/importers/bitst/importer.py", line 18, in <module>
    class Importer(importer.ImporterProtocol):
  File "/home/patrickr/ledger/importers/bitst/importer.py", line 27, in Importer
    @PredictPostings()
  File "/home/patrickr/projects/smart_importer/smart_importer/predict_postings.py", line 60, in __call__
    class PredictPostingsImporter(OriginalImporter):
TypeError: function() argument 1 must be code, not str
johannesjh commented 6 years ago

tried to but could not reproduce, see added file smart_importer/tests/predict_postings_example_test.py.

johannesjh commented 6 years ago

btw, I suggest we rename this ticket and name it "apply decorator to importer class", since that came out to be the intended solution?

tarioch commented 6 years ago

agreed to the rename

As for the issue, I'm wondering if there might be something with the interaction with bean-extract. I'm running the importer like this

bean-extract main.import downloads -f main.beancount

(using master version of beancount as the code for the existing_entries logic is not yet in the released build)

johannesjh commented 6 years ago

tried to reproduce by fleshing out the example. I am getting errors, but different than yours. I pushed broken code (only in the branch) so you can make the example more realistic, more similar to how you intend to use it. I have to interrupt now, can continue on this later. thx Johannes

tarioch commented 6 years ago

I'll do that as soon as I have some time, Thanks for preparing it that far

johannesjh commented 6 years ago

ok thanks

btw, I am considering to use a python decorator library instead of the DIY solution to simplify the code. apparently, how a decorator is called depends on how it is applied... PredictPostings is different from PredictPostings(), and these differences need to be handled in the decorator implemetation (or by using a library).

tarioch commented 6 years ago

My mistake, I had another importer which was picked up where I forgot to adjust the the annotation, will look at it some more.

tarioch commented 6 years ago

Did a small change to the example and now it's producing the same error I have locally.

Basically I made existing_entries non optional and I think the error it produced before was caused by the same thing, namely that existing_entries was None instead of the expected list.

johannesjh commented 6 years ago

I made existing_entries non optional

that breaks with the API defined in the ImporterProtocol because existing_entries should be a keyword argument, not a positional argument.

tarioch commented 6 years ago

I think I've narrowed it down some more. extract.py in beancount checks the signature and only passes the existing_entries if they exist in the signature, as the interceptor now doesn't have this in the signature it does not get passed and then things fail afterwards.

tarioch commented 6 years ago

Maybe using a library would allow to keep the signature intact?

johannesjh commented 6 years ago

not sure how a library could help in this case. - why not use a simple assertion that existing_entries have been passed? (and document the restriction that existing_entries have to passed as argument in the documentation)

def extract(self, file, existing_entries=None):
    '''
    :param existing_entries: must not be null since this importer relies on existing entries
    '''
    assert existing_entries
    # ... 
    # return the_imported_entries
tarioch commented 6 years ago

What I meant is the following: beancount checks the signature and only passes existing_entries if it's in the signature

from beancount/ingest/extract.py

    kwargs = {}
    if 'existing_entries' in inspect.signature(importer.extract).parameters:
        kwargs['existing_entries'] = existing_entries
    new_entries = importer.extract(file, **kwargs)

As the wrapper class does not have existing_entries in the signature, this will not work properly. So I think we'll have to add the existing_entries to our signature and maybe do something similar to the beancount code to pass or not pass it depending on the actual signature.

tarioch commented 6 years ago

I have it now working with the fixed (full) signature which includes existing_entries. I think to be backward compatible we probably need to do the same as beancount and check the signature and depending on that add or not add existing_entries

tarioch commented 6 years ago

Implemented :)

floco commented 6 years ago

Hello I'm busy writing an importer for my csv bank statements and smart importer looks like a very good and elegant way to predict postings and payees. My importer runs fine but as soon as I add the decorator unfortunately I get the same error than @tarioch above. TypeError: function() argument 1 must be code, not str Then I tried running the example from the branch "decorator-for-importer-classes" (launching ./run-bean-extract.sh) but then I get another error: ;; -*- mode: beancount -*- **** /Users/julien/Documents/4_Code/smart_importer/smart_importer/examples/downloads/downloaded.csv ERROR:root:Importer smart_importer.predict_postings.PredictPostingsImporter.extract() raised an unexpected error: 'NoneType' object is not iterable ERROR:root:Traceback: Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/beancount/ingest/extract.py", line 168, in extract allow_none_for_tags_and_links=allow_none_for_tags_and_links) File "/usr/local/lib/python3.6/site-packages/beancount/ingest/extract.py", line 64, in extract_from_file new_entries = importer.extract(file) File "/Users/julien/Documents/4_Code/smart_importer/smart_importer/predict_postings.py", line 73, in extract existing_entries File "/Users/julien/Documents/4_Code/smart_importer/smart_importer/predict_postings.py", line 86, in _extract self.converted_training_data = [TxnPosting(t, p) for t in self.training_data for p in t.postings TypeError: 'NoneType' object is not iterable I'm new to beancount and python so that probably doesn't help ;-) Thanks in advance for any help.

tarioch commented 6 years ago

If you use the master branch, then it should work fine with the decorator on the extract function.

floco commented 6 years ago

Thanks for your help :-) I was able to go a bit further after adding opening statements for each assets/expenses in my training_data. Otherwise the file would not get validated by Beancount loader. Now stuck as it seems to be unable to recognize the transactions from my importer extract function. Anyway I will keep looking :-)

DEBUG:smart_importer.machinelearning_helpers:Reading training data from file "/Users/julien/Documents/3_Finance/Beancount/trainingdata.beancount"... DEBUG:smart_importer.machinelearning_helpers:Finished reading training data. DEBUG:smart_importer.predict_postings:About to train the machine learning model... INFO:smart_importer.predict_postings:Finished training the machine learning model. DEBUG:smart_importer.predict_postings:About to call the importer's extract function in order to receive entries to be imported... Traceback (most recent call last): File "/Users/julien/Documents/4_Code/blais-beancount-9c9de54513ab/bin/bean-extract", line 4, in <module> from beancount.ingest.extract import main; main() File "/Users/julien/.pyenv/versions/3.6.4/lib/python3.6/site-packages/beancount/ingest/extract.py", line 195, in main args, config, downloads_directories = scripts_utils.parse_arguments(parser) File "/Users/julien/.pyenv/versions/3.6.4/lib/python3.6/site-packages/beancount/ingest/scripts_utils.py", line 56, in parse_arguments mod = runpy.run_path(args.config) File "/Users/julien/.pyenv/versions/3.6.4/lib/python3.6/runpy.py", line 263, in run_path pkg_name=pkg_name, script_name=fname) File "/Users/julien/.pyenv/versions/3.6.4/lib/python3.6/runpy.py", line 96, in _run_module_code mod_name, mod_spec, pkg_name, script_name) File "/Users/julien/.pyenv/versions/3.6.4/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "config.py", line 11, in <module> ing.IngImporter('Assets:Ing:Checking'), File "/Users/julien/Documents/4_Code/smart_importer/smart_importer/predict_postings.py", line 142, in _extract logger.debug(f"Received {len(transactions)} entries by calling the importer's extract function.") TypeError: object of type 'NoneType' has no len()

johannesjh commented 6 years ago

The branch decorator-for-importer-classes had merge conflicts that were difficult to resolve (I could not get the testcases to run successfully). So I branched a fresh, new branch feature/apply-decorator-to-class from master and re-created the necessary changes so that decorators can (and must) now be applied to importer classes, as opposed to extract methods.

I will merge soon to prevent the new branch from ageing again.

johannesjh commented 6 years ago

Cleanup: I will delete the old branch decorator-for-importer-classes since it is no longer needed now.

@tarioch: I hope that is ok for you, since you also committed into the branch? I still have a local copy of the branch in case you need it.