beancount / fava

Fava - web interface for Beancount
https://beancount.github.io/fava/
MIT License
1.97k stars 286 forks source link

Import: Intelligent suggestions for account names #579

Closed johannesjh closed 6 years ago

johannesjh commented 7 years ago

As discussed in #436 and #503, it would be great to have intelligent suggestions for account names when doing imports. (I hope you don't mind that I am bringing this old issue up again?).

Implementation originally was originally planned in #503:
intelligent account suggestions for fava's import process were discussed by @aumayr in #503, but the merge request was closed and the feature thus postponed.

@aumayr commented on Feb 20, 2017 in #503: Updated TODO-List:

  • [ ] Support for bean-file
  • [ ] "Intelligent" way to suggest accounts, etc. (like discussed in #436)
  • [x] Support for more types of entries (balance and note seem important)
  • [ ] Tests

@aumayr commented on Apr 6, 2017 in #503: Only four points remain, which can be implemented after the PR is merged:

  • Figure out (and integrate) the duplicate-entry-detection-mechanism by bean-extract. For me, it never recognises an entry as duplicate. If duplicates are recognised, mark them in the Ingest-UI as "duplicate".
  • Add bean-file functionality.
  • The Ingest-UI is slow for >50 entries.
  • Add example code for how to implement an importer and configure Fava to use it. This should also contain a blueprint for a best-practice importer.

Links to related implementations:
Some more infos and links to existing implementations in other tools (that can maybe be reused?) can be found in #436:

@johannesjh commented on Jan 16, 2017 in #436: Are you aware of existing implementations? There even exists one tool written in Python.

in the meantime, I found these additional tools, also see this list: http://plaintextaccounting.org/#data-importconversion

Technical design:
Does the current implementation of the import process save enough data so that a machine learning algorithm can learn from previous imports? Conceptually, this data would consist of mappings from original source data (e.g., a line from CSV file) to output (corresponding beancount transactions).

What would be a suitable place to store this data? I found the recommendation that importers should generate a __source__ metadata field for each transaction. But the data could also be stored in a separate file.

Interaction design, to be discussed:
Given a user started an import process by clicking "extract" next to a file that has been identified (or in the future: uploaded), when the user then looks at the extracted data... Where in the UI should the intelligent account suggestions be displayed? Some ideas:

aumayr commented 7 years ago

@johannesjh Thanks for picking up where I left off, and thanks for all the infos and overview! Greatly appreciate your help!

A few answers:

Does the current implementation of the import process save enough data so that a machine learning algorithm can learn from previous imports?

The current implementation on the Fava side does not save anything. It is up to the implementation of the importers to store such data (up until now), and for example my importers do not store anything at all.

The current implementation of the Fava Import UI does, however, sort the dropdowns for the accounts based on the payee of the transaction, so some intelligence is already implemented:

https://github.com/beancount/fava/blob/ce729050593a94e4f22715b40379b0930002f0c8/fava/core/attributes.py#L55-L62

What would be a suitable place to store this data?

The __source__ metadata field for each transaction sounds reasonable. Currently this field is used for the individual importer to report to the Fava Import UI what to display next to the transaction as the "Source Code/Line" of the transaction, but it is not stored in the beancount file. This could easily be changed, but I think then we would have to introduce another "hidden" metadata field to communicate the "Source Code/Line" of the transaction from the importer to the Fava Import UI to display to the user (as the data to display (eg. a line from a CSV-file, etc.) and the data to store next to the transaction in __source__ (eg. a Hash of a CSV-row, etc.) may not be the same).

For where to store it: I'm all for keeping it all in the beancount file, in __source__ (or similar).

Fava could automatically suggest postings that balance all unbalanced transactions generated by the importer.

I think Fava should only expose helpers for the importers to use, so individual importers can decide themselves what to do.

(unless importers are expected to only generate balanced transactions?)

They are not.

The dropdown list for manually selecting an account could be sorted by relevance based on the smart suggestions.

This is already done by Fava (see above) in a "light" way, but could be hinted by the importer, which suggests a list of accounts with according relevance score to the Fava Import UI.

johannesjh commented 7 years ago

great, thank you! some notes in response:

but I think then we would have to introduce another "hidden" metadata field to communicate the "Source Code/Line" of the transaction from the importer to the Fava Import UI to display to the user (as the data to display (eg. a line from a CSV-file, etc.) and the data to store next to the transaction in source (eg. a Hash of a CSV-row, etc.) may not be the same).

I think @blais had a similar idea already, judging from the beancount.core.data.new_metadata function. E.g., the following code would create a metadata dict that includes filename, line number and source string:

from beancount.core import data
meta = data.new_metadata('filename', '10', {'__source__': 'this;is;the;original;csv;line'})

I think Fava should only expose helpers for the importers to use, so individual importers can decide themselves what to do.

To make sure I understand this right... Are you thinking of a control flow like this?

  1. fava calls an importer
  2. the importer reads the source file (e.g., CSV)
  3. the importer (optionally?) uses fava (or beancount?) helpers to come up with ranked suggestions for likely account names
  4. instead of simply returning transactions with hardcoded account names, would the importer return another data structure with lists of suggestions that users may then choose from in fava's UI?

That would provide a lot of flexibility to the importers. But I believe it would come at a significant cost: Importers would have to use a new data structure for communicating lists of suggestions back to fava (e.g., tuples of suggestions and probability values, or simply ranked suggestions). Importers would depend on fava's helpers (they currently only depend on beancount). Importers would have to care about smart suggestions and machine learning (which arguably should not be their goal?). Instead, I think we generally want to keep the importers as simple as possible because users are expected to quickly write their own import scripts for the various bank institutions that they use. So I think another control flow would be easier.

  1. Fava calls an importer
  2. The importer translates the source file (CSV) into beancount transactions. The importer does not implement or otherwise trigger any machine learning, but it can add metadata to assist other components where machine learning is implemented.
  3. Fava or beancount implement the machine learning. Based on this, fava can provide smart editing features in the UI, primarily for users to modify and complete imported data, but in the future possibly also for other smart editing features.

Smart editing of imported data could comprise:

johannesjh commented 7 years ago

Notes on machine learning:
EDIT: I am adding more notes to this post as I keep finding stuff.

Lessons Learned from GnuCash's Bayesian Classifier: I just found this interesting article with lessons learned from the GnuCash project. They added special scripts to keep the training data clean in case accounts are renamed or deleted. See: https://wiki.gnucash.org/wiki/Bayes

Properly framing the problem:
Smart editing of imported data involves multiple challenges that must be framed and approached in their own, different ways. Precisely framing the problems will hopefully help to select proper algorithms and tools.

Choosing the right tools

sklearn_algorithms

Examples:

aumayr commented 7 years ago

To make sure I understand this right... Are you thinking of a control flow like this?

Yes, exactly. I think this discussion is vital (and I do not have strong opinions for either solution), because it will determine how useful this becomes.

The "Fava-does-it-all"-approach might lead to many "quick wins" for existing plugins, and hides complexity from the user/developer, but it might not lead to perfect results.

The "Importer-does-it-with-the-help-of-Fava"-approach is more work and headache for the user/developer, but it can adapt to the data structures and information at hand, leading to better results.

I think we should discuss both approaches, to the point of discussing how the "interfaces" (helpers from Fava/interface between Fava and the Importer) look, to get a better feeling which way this should go.

Choosing the right tools

As this might lead to more overhead (with Scikit-learn for example), this should be an optional feature/install IMHO, like the Excel-export feature is right now. If the user want's to use these powerful frameworks, he/she can install the required dependencies and it becomes available.

johannesjh commented 7 years ago

agreed, we should discuss and weigh both approaches.

One more idea: The "Importer-does-it-with-the-help-of-Fava"-approach opens up another promising strategy: By implementing a smart importer that covers typical CSV import usecases, we could eliminate the need for users to write import scripts altogether. As a result, instead of implementing an importer, users would configure an importer in the GUI, which they would then train during usage. Possible flow of user interactions:

wouldn't that be nice! ;-)

blais commented 7 years ago

Improving the built-in CSV importer is part of the plan. I think it should be part of Beancount itself. It's already configurable.

See notes here: https://bitbucket.org/blais/beancount/src/621cec5ed38bcd128a3502a3b5c367f283deffe2/TODO?at=default&fileviewer=file-view-default#TODO-1184 https://bitbucket.org/blais/beancount/src/621cec5ed38bcd128a3502a3b5c367f283deffe2/beancount/ingest/importers/csv.py?at=default&fileviewer=file-view-default

On Thu, Sep 14, 2017 at 10:30 AM, johannesjh notifications@github.com wrote:

agreed, we should discuss and weigh both approaches.

One more idea: The "Importer-does-it-with-the-help-of-Fava"-approach opens up another promising strategy: By implementing a smart importer that covers typical CSV import usecases, we could eliminate the need for users to write import scripts altogether. As a result, instead of implementing an importer, users would configure an importer in the GUI, which they would then train during usage. Possible flow of user interactions:

  • In fava's GUI, click to create a new importer.
  • A dialogue opens with configuration options for the new importer (e.g., name of importer, account name, ...). User can save (and later edit) such configurations.
  • Users train the importer by simply starting the import process and by correcting the imported data.

wouldn't that be nice! ;-)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/beancount/fava/issues/579#issuecomment-329412228, or mute the thread https://github.com/notifications/unsubscribe-auth/AAUgk0nJLuyGqHqs0gs8Qqo3lG1ebEvvks5siOQQgaJpZM4PNqX6 .

aumayr commented 7 years ago

One more idea

This is the way to go IMHO. An importer should be as "small" as possible, but with the possibility of many callbacks that hook into/tweak the Fava-part of the importer. So if someone has a really strange set of CSV-files to deal with or needs some logic for how to skip lines (eg. I saw a poorly designed CSV-file with every 25th row being some sort of "sum", suggesting it rendered a paginated view).

I think it should be part of Beancount itself.

Partly: Fava differs from Beancount as it needs more information to display the correct UI, eg. a list of suggested accounts, and maybe even react on what the user did already input.

johannesjh commented 6 years ago

I built a prototype together with @heerpa, a friend of mine. Please have a look at the following iPython notebook: https://gist.github.com/johannesjh/956179856957348e4fad48514b9824fc https://nbviewer.jupyter.org/gist/johannesjh/956179856957348e4fad48514b9824fc

The prototype uses scikit-learn to train an SVM classifier with beancount example data. The algorithm learns from multiple properties of a transaction, including narration, payee, day of week, and day of month. Based on previously learned data, the classifier predicts the most likely account name for any new transaction to be imported. It also generates ranked suggestions suitable for populating dropdown lists in the UI.

It would be great if you could provide some feedback so we can arrange further steps towards building the smart importer. Thank you!

PS: To run an executable and editable version of the notebook:

virtualenv -p python3 ~/somefolder/virtualenv
source ~/somefolder/virtualenv/bin/activate
pip3 install beancount jupyter scikit-learn scipy numpy ipython
jupypter notebook beancount-machine-learning3.ipynb
johannesjh commented 6 years ago

Hi!

I integrated the prototype from my previous post into beancount, see this repository: https://github.com/johannesjh/smart_importer

Using the machine learning functionality from the prototype (now in machinelearning.py), I built two possible integrations:

  1. An extension of beancount.ingest.importers.csv.Importer, see smart_csv_importer.py
  2. A decorator for an importer's extract function, see predict_postings.py

I tried it out with real-world data and it worked just fine for my purposes. But I would love to see the feature upstream in beancount or fava, and I would really need your feedback regarding if and how we to do the integration. Thanks! Johannes

yagebu commented 6 years ago

Hey, sorry for not giving any feedback on this so far, I've been quite busy the last few months. In about ten days I'll have more free time and will give you some feedback. Thanks already for the work on this!

yagebu commented 6 years ago

I think the decorator for the extract function is the way to go forward for this. This would allow all kinds of importers to benefit from this. Account suggestions could then be passed as metadata, which Fava could display in the completion dropdown. I guess something like a __completions__ list on the postings would be good. Would you want to use these completion suggestions only for the imports or for all account completions?

I haven't gotten around to testing it myself as I think it's unsuitable for my usecase: When importing I first "clean" up the payee names after which the current account completions are usually perfect. Would your implementation also work to clean up the account names (without training data, as my file only contains the "cleaned" payee names)?

johannesjh commented 6 years ago

I think the decorator for the extract function is the way to go forward for this. This would allow all kinds of importers to benefit from this.

I also like how the decorator version allows to add the machine learning predictions as a cross-cutting concern. => I removed the SmartCsvImporter class in favor of the decorator.

Account suggestions could then be passed as metadata, which Fava could display in the completion dropdown. I guess something like a __completions__ list on the postings would be good.

thank you, I like the suggestion. I can implement that, just need to take time for it.

I think the decorator should furthermore differentiate between predictions and suggestions being made:

I haven't gotten around to testing it myself as I think it's unsuitable for my usecase: When importing I first "clean" up the payee names after which the current account completions are usually perfect. Would your implementation also work to clean up the account names (without training data, as my file only contains the "cleaned" payee names)?

Can you explain a bit more about your workflow, please:

Note: It is easiest to train a separate machine learning model because only very few scikit-learn algorithms support multiple outputs. In scikit-learn terminology, the second posting's account is one multiclass output, and the payee is another multiclass output, so the algorithm would need to support multiclass-multioutput classification, compare http://scikit-learn.org/stable/modules/multiclass.html

yagebu commented 6 years ago

I think the decorator should furthermore differentiate between predictions and suggestions being made

Would this be something the user specifies when they use the decorator, or does the ML model allow you to make this distinction (I guess it would be a prediction if the model is certain that there's only one possible candidate, right?).

Can you explain a bit more about your workflow, please

My bank's csv file contains the payees as they are listed on my bank statement, which means they are sometimes written in a different way (e.g. all caps or abbreviated) than I want or contain extra useless information that I don't want in my ledger. I go through them by hand in Fava's import screen and just type in the payees that I want, which is quite fast thanks to the autocompletion. Since fava's account autocompletion uses the current payee name to suggest matching accounts, the first suggested account is then usually the correct one.

johannesjh commented 6 years ago

thank you for the explanations regarding your workflow. the implementation currently predicts the account names of missing postings (as indicated by the @PredictPostings decorator class name). I think this could be useful in your scenario as well because the missing postings can be predicted based on messy payees as well, so you would not need to edit/enter the payees by hand in order to get predictions of account names.

Besides, if you would like to automatically predict nice payee names, it would be easy to write an additional @PredictPayees decorator class. In fact, adding more decorators is now much easier because I refactored and cleaned up the code tonight: I can now feed beancount transactions directly into scikit-learn pipelines, this reduced a lot of glue code.

Regarding predictions vs suggestions: In my current implementation, the behavior depends on what the user specifies when they use the decorator.

johannesjh commented 6 years ago

Update: I added a PredictPayees decorator, see https://github.com/johannesjh/smart_importer/blob/master/predict_payees.py

tarioch commented 6 years ago

@johannesjh I really like the work you started. For me this is something that could(should) live outside of fava as it's a really useful thing to have without using fava. Right now I'm still struggling to get it working as it seems to require python 3.6 and I have an issue where fava does not run on 3.6

johannesjh commented 6 years ago

Hi Patrick! I guess your problems with python3.6 are resolved since you have been busy working with the decorator and even submitted two pull requests, right? Thx, Johannes

tarioch commented 6 years ago

Yep, got it working, started now playing around with it :)

johannesjh commented 6 years ago

@aumayr @yagebu: I'd like to bring this issue up again and talk about next steps, how to integrate the machine learning functionality with fava.

Current status: The smart_importer project has worked well in practice, both @tarioch and myself have been using it actively for quite a while now. Importers can be decorated with @PredictPostings or @PredictPayees in order to get predictions (autocompletion of the most likely value) and suggestions (a ranked list of likely values to choose from) for missing second postings and for missing payees.

Integration with fava: The most important topic for integration are suggestions, i.e., fava could populate its dropdown lists with suggested values provided by a smart_importer. To get this to work, we would have to agree how the smart importer should provide suggestions to fava. For example, as metadata fields? The current implementation writes suggestions into metadata fields called __suggested_accounts__ and __suggested_payees__.

Integration with beancount: Somewhat off-topic in this post, but @blais: I'd love to hear your opinion on how to proceed further, and also where the smart_importer code should live in the long term. Can we turn it into an official beancount feature, part of beancount's code base?

What do you think?

blais commented 6 years ago

On Tue, Apr 17, 2018 at 4:54 PM, johannesjh notifications@github.com wrote:

@aumayr https://github.com/aumayr @yagebu https://github.com/yagebu: I'd like to bring this issue up again and talk about next steps, how to integrate the machine learning functionality with fava.

Current status: The smart_importer https://github.com/johannesjh/smart_importer project has worked well in practice, both @tarioch https://github.com/tarioch and myself have been using it actively for quite a while now. Importers can be decorated with @PredictPostings or @PredictPayees in order to get predictions (autocompletion of the most likely value) and suggestions (a ranked list of likely values to choose from) for missing second postings and for missing payees.

Thanks for the pointer, I didn't know about the project. Looks really great! :-) I should start to use it.

Integration with fava: The most important topic for integration are

suggestions, i.e., fava could populate its dropdown lists with suggested values provided by a smart_importer. To get this to work, we would have to agree how the smart importer should provide suggestions to fava. For example, as metadata fields? The current implementation writes suggestions into metadata fields called suggested_accounts and suggested_payees.

Integration with beancount: Somewhat off-topic in this post, but @blais https://github.com/blais: I'd love to hear your opinion on how to proceed further, and also where the smart_importer https://github.com/johannesjh/smart_importer code should live in the long term. Can we turn it into an official beancount feature, part of beancount's code base?

What do you think?

Mainly because of the scikit-learn dependency, I think it's best to continue maintaining it separately for now. (Be assured that at this point the Beancount schema/data structures are really quite unlikely to change, so it won't be difficult to maintain.)

I'd like to make it easier to register in hooks for running these within Beancount, but given the importers are almost always custom, it doesn't seem to be a huge problem to integrate them the way you did.

Do you see any particular reason it should prefer to live within the Beancount codebase? (It looks like really clean and simple integration right now.) Is there anything I could change to Beancount to make this integration easier ? If so, what would that be?

I suppose visibility and a sense that things are "well integrated" might be a reason to move this in. Perhaps a simpler thing that can be done is to move it under the Beancount organization to give it a bit more exposure (github.com/beancount) and a more "official" look.

Thoughts?

aumayr commented 6 years ago

@johannesjh Awesome work!!

Moving it to https://github.com/beancount/smart_importer is no problem. Just tell if you want to do that, and I create the repo for you.

I do think it should live in it's own repo & package, but should be integrated with Beancount & Fava as an optional install like the Excel-support in Fava is: https://github.com/beancount/fava/blob/master/setup.py#L44-L54

This way, users that want to benefit from these awesome features and want to install all the dependencies can do a simple pip install beancount[smart_importer] or pip install fava[smart_importer]. By integrating it into the setup.py of Beancount & Fava it becomes a "first-class citizen" of these projects, but stays in it's own repo (for clear division of concerns and mainability, etc.).

As for the integration in the Fava importer (and "Add Transaction"-form in general): @yagebu What do you think about the idea with the __suggested_accounts__ and __suggested_payees__ metadata fields? Or should we extend the data structures to hold that data? Should we provide a hook for asking an external tool (like the smart_importer) about suggestions and implement the current suggestions-mechanism with this hook too?

johannesjh commented 6 years ago

Hi, thank you for your feedback, I am glad you like it, and thank you Martin for the post on the mailinglist! I agree with both of your posts. I guess that leaves us with the following todos:

xentac commented 6 years ago

I second @blais' suggestion for making it work with emacs suggestions. That way I can do a similar thing with vim-beancount and get it through my workflow :)

johannesjh commented 6 years ago

Regarding integration with text editors: I created a ticket for it, see https://github.com/johannesjh/smart_importer/issues/32

aumayr commented 6 years ago

@johannesjh I added you to the beancount org on Github, so you can transfer the repo over.

johannesjh commented 6 years ago

thank you, I moved the repository over to beancount/smart_importer.

aumayr commented 6 years ago

@yagebu Any thoughts on the Fava-integration-part?

yagebu commented 6 years ago

What do you think about the idea with the __suggested_accounts__ and __suggested_payees__ metadata fields? Or should we extend the data structures to hold that data?

Storing it as metadata sounds good.

Should we provide a hook for asking an external tool (like the smart_importer) about suggestions and implement the current suggestions-mechanism with this hook too?

A hook would be fine - for me the current suggestions work really well though. With imports one has to deal with messier data compared to the transaction form. In any case, I think the current mechanism should still form the basis of the suggestions, e.g., in case smart_importer doesn't suggest the right accounts we should still list all accounts.

I don't think much would be gained by adding it as an optional dependency - it's just a single package to install anyway.

yagebu commented 6 years ago

I tried using smart_importer recently and in failed in various places so I gave up, might try again in the future ;)

I believe the main point of this issue has been addressed by smart_importer so I'll close the issue, feel free to open a more specific one about integration with Fava's import system

alexiri commented 6 years ago

Would it be possible to remove the __suggested_accounts__ and __suggested_payees__ fields automatically when new entries are imported? They cause invalid token errors and deleting them one by one is a bit of a pain.

How about something like this? https://github.com/alexiri/fava/commit/d0d04f3dff4bedaa0bba1258bf045554d0beb2c5

johannesjh commented 6 years ago

@alexiri As a quick fix, it is possible to turn suggestions off by setting the suggest_accounts and suggest_payees arguments to False. I am also considering to turn this off by default, see https://github.com/beancount/smart_importer/issues/50

On the long run, I would like to see a feature where fava populates the list of suggested account names based on suggestions within the importer's output. As @yagebu suggested, this should happen in a new ticket.

EDIT: I opened ticket #801 to follow up on this topic.