Closed johannesjh closed 6 years ago
@johannesjh Thanks for picking up where I left off, and thanks for all the infos and overview! Greatly appreciate your help!
A few answers:
Does the current implementation of the import process save enough data so that a machine learning algorithm can learn from previous imports?
The current implementation on the Fava side does not save anything. It is up to the implementation of the importers to store such data (up until now), and for example my importers do not store anything at all.
The current implementation of the Fava Import UI does, however, sort the dropdowns for the accounts based on the payee of the transaction, so some intelligence is already implemented:
What would be a suitable place to store this data?
The __source__
metadata field for each transaction sounds reasonable. Currently this field is used for the individual importer to report to the Fava Import UI what to display next to the transaction as the "Source Code/Line" of the transaction, but it is not stored in the beancount file. This could easily be changed, but I think then we would have to introduce another "hidden" metadata field to communicate the "Source Code/Line" of the transaction from the importer to the Fava Import UI to display to the user (as the data to display (eg. a line from a CSV-file, etc.) and the data to store next to the transaction in __source__
(eg. a Hash of a CSV-row, etc.) may not be the same).
For where to store it: I'm all for keeping it all in the beancount file, in __source__
(or similar).
Fava could automatically suggest postings that balance all unbalanced transactions generated by the importer.
I think Fava should only expose helpers for the importers to use, so individual importers can decide themselves what to do.
(unless importers are expected to only generate balanced transactions?)
They are not.
The dropdown list for manually selecting an account could be sorted by relevance based on the smart suggestions.
This is already done by Fava (see above) in a "light" way, but could be hinted by the importer, which suggests a list of accounts with according relevance score to the Fava Import UI.
great, thank you! some notes in response:
but I think then we would have to introduce another "hidden" metadata field to communicate the "Source Code/Line" of the transaction from the importer to the Fava Import UI to display to the user (as the data to display (eg. a line from a CSV-file, etc.) and the data to store next to the transaction in source (eg. a Hash of a CSV-row, etc.) may not be the same).
I think @blais had a similar idea already, judging from the beancount.core.data.new_metadata
function. E.g., the following code would create a metadata dict that includes filename, line number and source string:
from beancount.core import data
meta = data.new_metadata('filename', '10', {'__source__': 'this;is;the;original;csv;line'})
I think Fava should only expose helpers for the importers to use, so individual importers can decide themselves what to do.
To make sure I understand this right... Are you thinking of a control flow like this?
That would provide a lot of flexibility to the importers. But I believe it would come at a significant cost: Importers would have to use a new data structure for communicating lists of suggestions back to fava (e.g., tuples of suggestions and probability values, or simply ranked suggestions). Importers would depend on fava's helpers (they currently only depend on beancount). Importers would have to care about smart suggestions and machine learning (which arguably should not be their goal?). Instead, I think we generally want to keep the importers as simple as possible because users are expected to quickly write their own import scripts for the various bank institutions that they use. So I think another control flow would be easier.
Smart editing of imported data could comprise:
Notes on machine learning:
EDIT: I am adding more notes to this post as I keep finding stuff.
Lessons Learned from GnuCash's Bayesian Classifier: I just found this interesting article with lessons learned from the GnuCash project. They added special scripts to keep the training data clean in case accounts are renamed or deleted. See: https://wiki.gnucash.org/wiki/Bayes
Properly framing the problem:
Smart editing of imported data involves multiple challenges that must be framed and approached in their own, different ways. Precisely framing the problems will hopefully help to select proper algorithms and tools.
Replacing import scripts altogether?: Having to write an import script in Python is arguably a big entry hurdle for new users. So maybe we can replace the import scripts altogether by implementing a smart importer? It would suffice to cover typical CSV imports because other, more exotic use cases can always be implemented by writing Python code. The smart importer would of course have to implement beancount.ingest.ImporterProtocol
. Some of the parameters would be configured by the user, ideally in fava's GUI, while other parameters could be learned automatically:
beancount.ingest.importers.csv.Col
(https://bitbucket.org/blais/beancount/src/621cec5ed38bcd128a3502a3b5c367f283deffe2/beancount/ingest/importers/csv.py?at=default), including whether there are separate columns for credit and debit, or just one column with positive or negative numbers.Suggesting account names for imported transactions can be framed as a text classification problem: We are dealing with supervised learning because training data exists from previous transactions. The content that we are trying to learn is textual. We probably have to preprocess the learning data in a similar way to the Lessons Learned from GnuCash's Bayesian Classifier, e.g., to exclude closed accounts. The output is a classification into categorial data (i.e., into available account names). A high-level description of the classification algorithm is described for example here on stackoverflow:
To solve your problem, here are the steps you should do:
- Create a feature extractor - that given a description of a restaurant, returns the "features" (under the > Bag Of Words model explained above) of this restaurant (denoted as example in the literature).
- Manually label a set of examples, each will be labeled with the desired class (Chinese, Belgian, Junk > food,...)
- Feed your labeled examples into a learning algorithm. It will generate a classifier. From personal experience, SVM usually gives the best results, but there are other choices such as Naive Bayes, Neural > Networks and Decision Trees (usually C4.5 is used), each has its own advantage.
- When a new (unlabeled) example (restaurant) comes - extract the features and feed it to your classifier - it will tell you what it thinks it is (and usually - what is the probability the classifier is correct).
Evaluation:
Evaluation of your algorithm can be done with cross-validation, or seperating a test set out of your labeled examples that will be used only for evaluating how accurate the algorithm is.Optimizations:
From personal experience - here are some optimizations I found helpful for the feature extraction:
- Stemming and eliminating stop words usually helps a lot.
- Using Bi-Grams tends to improve accuracy (though increases the feature space significantly).
- Some classifiers are prone to large feature space (SVM not included), there are some ways to overcome it, such as decreasing the dimensionality of your features. PCA is one thing that can help you with it.
- Genetic Algorithms are also (empirically) pretty good for subset selection.
Suggesting payees should be framed as text classification, similar to the suggestion of account names. I.e., existing transactions with payees would be interpreted as labeled training data. Text classification can then suggest a likely payee for each newly imported transaction.
Detecting duplicate transactions: The input data for duplicate detection is not purely textual but also numerical. The problem involves some domain-specific rules, such as: the transaction dates must be close to each other, typically within a few days. Also, duplicates must involve the same amount and currency. I am not sure if we should frame the problem as supervised or unsupervised learning:
Suggesting linked transactions: A similar problem to duplicate detection. When using transfer accounts, it would be nice to have suggestions for linking corresponding transactions, e.g., as implemented in pull requests 522.
Choosing the right tools
TextBlob is focused on text analysis, as its name implies. Much smaller size of 634kB, plus 1.2MB for the underlying NLTK package.
TextBlob is a new python natural language processing toolkit, which stands on the shoulders of giants like NLTK and Pattern, provides text mining, text analysis and text processing modules for python developers.
Dedupe and CsvDedupe seem to be popular and convenient tools for duplicate detection.
A DIY approach (as currently taken in beancount and fava) would also be a viable option, and would avoid depending on large, general-purpose machine learning libs.
ExponentialDecayRanker
in fava/util/ranking.py find_similar_entries
in beancount/ingest/similar.pyExamples:
To make sure I understand this right... Are you thinking of a control flow like this?
Yes, exactly. I think this discussion is vital (and I do not have strong opinions for either solution), because it will determine how useful this becomes.
The "Fava-does-it-all"-approach might lead to many "quick wins" for existing plugins, and hides complexity from the user/developer, but it might not lead to perfect results.
The "Importer-does-it-with-the-help-of-Fava"-approach is more work and headache for the user/developer, but it can adapt to the data structures and information at hand, leading to better results.
I think we should discuss both approaches, to the point of discussing how the "interfaces" (helpers from Fava/interface between Fava and the Importer) look, to get a better feeling which way this should go.
Choosing the right tools
As this might lead to more overhead (with Scikit-learn for example), this should be an optional feature/install IMHO, like the Excel-export feature is right now. If the user want's to use these powerful frameworks, he/she can install the required dependencies and it becomes available.
agreed, we should discuss and weigh both approaches.
One more idea: The "Importer-does-it-with-the-help-of-Fava"-approach opens up another promising strategy: By implementing a smart importer that covers typical CSV import usecases, we could eliminate the need for users to write import scripts altogether. As a result, instead of implementing an importer, users would configure an importer in the GUI, which they would then train during usage. Possible flow of user interactions:
wouldn't that be nice! ;-)
Improving the built-in CSV importer is part of the plan. I think it should be part of Beancount itself. It's already configurable.
See notes here: https://bitbucket.org/blais/beancount/src/621cec5ed38bcd128a3502a3b5c367f283deffe2/TODO?at=default&fileviewer=file-view-default#TODO-1184 https://bitbucket.org/blais/beancount/src/621cec5ed38bcd128a3502a3b5c367f283deffe2/beancount/ingest/importers/csv.py?at=default&fileviewer=file-view-default
On Thu, Sep 14, 2017 at 10:30 AM, johannesjh notifications@github.com wrote:
agreed, we should discuss and weigh both approaches.
One more idea: The "Importer-does-it-with-the-help-of-Fava"-approach opens up another promising strategy: By implementing a smart importer that covers typical CSV import usecases, we could eliminate the need for users to write import scripts altogether. As a result, instead of implementing an importer, users would configure an importer in the GUI, which they would then train during usage. Possible flow of user interactions:
- In fava's GUI, click to create a new importer.
- A dialogue opens with configuration options for the new importer (e.g., name of importer, account name, ...). User can save (and later edit) such configurations.
- Users train the importer by simply starting the import process and by correcting the imported data.
wouldn't that be nice! ;-)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/beancount/fava/issues/579#issuecomment-329412228, or mute the thread https://github.com/notifications/unsubscribe-auth/AAUgk0nJLuyGqHqs0gs8Qqo3lG1ebEvvks5siOQQgaJpZM4PNqX6 .
One more idea
This is the way to go IMHO. An importer should be as "small" as possible, but with the possibility of many callbacks that hook into/tweak the Fava-part of the importer. So if someone has a really strange set of CSV-files to deal with or needs some logic for how to skip lines (eg. I saw a poorly designed CSV-file with every 25th row being some sort of "sum", suggesting it rendered a paginated view).
I think it should be part of Beancount itself.
Partly: Fava differs from Beancount as it needs more information to display the correct UI, eg. a list of suggested accounts, and maybe even react on what the user did already input.
I built a prototype together with @heerpa, a friend of mine. Please have a look at the following iPython notebook: https://gist.github.com/johannesjh/956179856957348e4fad48514b9824fc https://nbviewer.jupyter.org/gist/johannesjh/956179856957348e4fad48514b9824fc
The prototype uses scikit-learn to train an SVM classifier with beancount example data. The algorithm learns from multiple properties of a transaction, including narration, payee, day of week, and day of month. Based on previously learned data, the classifier predicts the most likely account name for any new transaction to be imported. It also generates ranked suggestions suitable for populating dropdown lists in the UI.
It would be great if you could provide some feedback so we can arrange further steps towards building the smart importer. Thank you!
PS: To run an executable and editable version of the notebook:
virtualenv -p python3 ~/somefolder/virtualenv
source ~/somefolder/virtualenv/bin/activate
pip3 install beancount jupyter scikit-learn scipy numpy ipython
jupypter notebook beancount-machine-learning3.ipynb
Hi!
I integrated the prototype from my previous post into beancount, see this repository: https://github.com/johannesjh/smart_importer
Using the machine learning functionality from the prototype (now in machinelearning.py
), I built two possible integrations:
beancount.ingest.importers.csv.Importer
, see smart_csv_importer.pyextract
function, see predict_postings.pyI tried it out with real-world data and it worked just fine for my purposes. But I would love to see the feature upstream in beancount or fava, and I would really need your feedback regarding if and how we to do the integration. Thanks! Johannes
Hey, sorry for not giving any feedback on this so far, I've been quite busy the last few months. In about ten days I'll have more free time and will give you some feedback. Thanks already for the work on this!
I think the decorator for the extract function is the way to go forward for this. This would allow all kinds of importers to benefit from this. Account suggestions could then be passed as metadata, which Fava could display in the completion dropdown. I guess something like a __completions__
list on the postings would be good. Would you want to use these completion suggestions only for the imports or for all account completions?
I haven't gotten around to testing it myself as I think it's unsuitable for my usecase: When importing I first "clean" up the payee names after which the current account completions are usually perfect. Would your implementation also work to clean up the account names (without training data, as my file only contains the "cleaned" payee names)?
I think the decorator for the extract function is the way to go forward for this. This would allow all kinds of importers to benefit from this.
I also like how the decorator version allows to add the machine learning predictions as a cross-cutting concern. => I removed the SmartCsvImporter class in favor of the decorator.
Account suggestions could then be passed as metadata, which Fava could display in the completion dropdown. I guess something like a
__completions__
list on the postings would be good.
thank you, I like the suggestion. I can implement that, just need to take time for it.
I think the decorator should furthermore differentiate between predictions and suggestions being made:
__completions__
metadata, leaving it up to the user to add second postings.I haven't gotten around to testing it myself as I think it's unsuitable for my usecase: When importing I first "clean" up the payee names after which the current account completions are usually perfect. Would your implementation also work to clean up the account names (without training data, as my file only contains the "cleaned" payee names)?
Can you explain a bit more about your workflow, please:
Note: It is easiest to train a separate machine learning model because only very few scikit-learn algorithms support multiple outputs. In scikit-learn terminology, the second posting's account is one multiclass output, and the payee is another multiclass output, so the algorithm would need to support multiclass-multioutput classification, compare http://scikit-learn.org/stable/modules/multiclass.html
I think the decorator should furthermore differentiate between predictions and suggestions being made
Would this be something the user specifies when they use the decorator, or does the ML model allow you to make this distinction (I guess it would be a prediction if the model is certain that there's only one possible candidate, right?).
Can you explain a bit more about your workflow, please
My bank's csv file contains the payees as they are listed on my bank statement, which means they are sometimes written in a different way (e.g. all caps or abbreviated) than I want or contain extra useless information that I don't want in my ledger. I go through them by hand in Fava's import screen and just type in the payees that I want, which is quite fast thanks to the autocompletion. Since fava's account autocompletion uses the current payee name to suggest matching accounts, the first suggested account is then usually the correct one.
thank you for the explanations regarding your workflow. the implementation currently predicts the account names of missing postings (as indicated by the @PredictPostings
decorator class name). I think this could be useful in your scenario as well because the missing postings can be predicted based on messy payees as well, so you would not need to edit/enter the payees by hand in order to get predictions of account names.
Besides, if you would like to automatically predict nice payee names, it would be easy to write an additional @PredictPayees
decorator class. In fact, adding more decorators is now much easier because I refactored and cleaned up the code tonight: I can now feed beancount transactions directly into scikit-learn pipelines, this reduced a lot of glue code.
Regarding predictions vs suggestions: In my current implementation, the behavior depends on what the user specifies when they use the decorator.
Update: I added a PredictPayees
decorator, see https://github.com/johannesjh/smart_importer/blob/master/predict_payees.py
@johannesjh I really like the work you started. For me this is something that could(should) live outside of fava as it's a really useful thing to have without using fava. Right now I'm still struggling to get it working as it seems to require python 3.6 and I have an issue where fava does not run on 3.6
Hi Patrick! I guess your problems with python3.6 are resolved since you have been busy working with the decorator and even submitted two pull requests, right? Thx, Johannes
Yep, got it working, started now playing around with it :)
@aumayr @yagebu: I'd like to bring this issue up again and talk about next steps, how to integrate the machine learning functionality with fava.
Current status: The smart_importer project has worked well in practice, both @tarioch and myself have been using it actively for quite a while now. Importers can be decorated with @PredictPostings
or @PredictPayees
in order to get predictions (autocompletion of the most likely value) and suggestions (a ranked list of likely values to choose from) for missing second postings and for missing payees.
Integration with fava: The most important topic for integration are suggestions, i.e., fava could populate its dropdown lists with suggested values provided by a smart_importer. To get this to work, we would have to agree how the smart importer should provide suggestions to fava. For example, as metadata fields? The current implementation writes suggestions into metadata fields called __suggested_accounts__
and __suggested_payees__
.
Integration with beancount: Somewhat off-topic in this post, but @blais: I'd love to hear your opinion on how to proceed further, and also where the smart_importer code should live in the long term. Can we turn it into an official beancount feature, part of beancount's code base?
What do you think?
On Tue, Apr 17, 2018 at 4:54 PM, johannesjh notifications@github.com wrote:
@aumayr https://github.com/aumayr @yagebu https://github.com/yagebu: I'd like to bring this issue up again and talk about next steps, how to integrate the machine learning functionality with fava.
Current status: The smart_importer https://github.com/johannesjh/smart_importer project has worked well in practice, both @tarioch https://github.com/tarioch and myself have been using it actively for quite a while now. Importers can be decorated with @PredictPostings or @PredictPayees in order to get predictions (autocompletion of the most likely value) and suggestions (a ranked list of likely values to choose from) for missing second postings and for missing payees.
Thanks for the pointer, I didn't know about the project. Looks really great! :-) I should start to use it.
Integration with fava: The most important topic for integration are
suggestions, i.e., fava could populate its dropdown lists with suggested values provided by a smart_importer. To get this to work, we would have to agree how the smart importer should provide suggestions to fava. For example, as metadata fields? The current implementation writes suggestions into metadata fields called suggested_accounts and suggested_payees.
Integration with beancount: Somewhat off-topic in this post, but @blais https://github.com/blais: I'd love to hear your opinion on how to proceed further, and also where the smart_importer https://github.com/johannesjh/smart_importer code should live in the long term. Can we turn it into an official beancount feature, part of beancount's code base?
What do you think?
Mainly because of the scikit-learn dependency, I think it's best to continue maintaining it separately for now. (Be assured that at this point the Beancount schema/data structures are really quite unlikely to change, so it won't be difficult to maintain.)
I'd like to make it easier to register in hooks for running these within Beancount, but given the importers are almost always custom, it doesn't seem to be a huge problem to integrate them the way you did.
Do you see any particular reason it should prefer to live within the Beancount codebase? (It looks like really clean and simple integration right now.) Is there anything I could change to Beancount to make this integration easier ? If so, what would that be?
I suppose visibility and a sense that things are "well integrated" might be a reason to move this in. Perhaps a simpler thing that can be done is to move it under the Beancount organization to give it a bit more exposure (github.com/beancount) and a more "official" look.
Thoughts?
@johannesjh Awesome work!!
Moving it to https://github.com/beancount/smart_importer is no problem. Just tell if you want to do that, and I create the repo for you.
I do think it should live in it's own repo & package, but should be integrated with Beancount & Fava as an optional install like the Excel-support in Fava is: https://github.com/beancount/fava/blob/master/setup.py#L44-L54
This way, users that want to benefit from these awesome features and want to install all the dependencies can do a simple pip install beancount[smart_importer]
or pip install fava[smart_importer]
. By integrating it into the setup.py
of Beancount & Fava it becomes a "first-class citizen" of these projects, but stays in it's own repo (for clear division of concerns and mainability, etc.).
As for the integration in the Fava importer (and "Add Transaction"-form in general): @yagebu What do you think about the idea with the __suggested_accounts__
and __suggested_payees__
metadata fields? Or should we extend the data structures to hold that data? Should we provide a hook for asking an external tool (like the smart_importer) about suggestions and implement the current suggestions-mechanism with this hook too?
Hi, thank you for your feedback, I am glad you like it, and thank you Martin for the post on the mailinglist! I agree with both of your posts. I guess that leaves us with the following todos:
__suggested_accounts__
and __suggested_payees__
metadata fields.I second @blais' suggestion for making it work with emacs suggestions. That way I can do a similar thing with vim-beancount and get it through my workflow :)
Regarding integration with text editors: I created a ticket for it, see https://github.com/johannesjh/smart_importer/issues/32
@johannesjh I added you to the beancount org on Github, so you can transfer the repo over.
thank you, I moved the repository over to beancount/smart_importer.
@yagebu Any thoughts on the Fava-integration-part?
What do you think about the idea with the
__suggested_accounts__
and__suggested_payees__
metadata fields? Or should we extend the data structures to hold that data?
Storing it as metadata sounds good.
Should we provide a hook for asking an external tool (like the smart_importer) about suggestions and implement the current suggestions-mechanism with this hook too?
A hook would be fine - for me the current suggestions work really well though. With imports one has to deal with messier data compared to the transaction form. In any case, I think the current mechanism should still form the basis of the suggestions, e.g., in case smart_importer doesn't suggest the right accounts we should still list all accounts.
I don't think much would be gained by adding it as an optional dependency - it's just a single package to install anyway.
I tried using smart_importer recently and in failed in various places so I gave up, might try again in the future ;)
I believe the main point of this issue has been addressed by smart_importer so I'll close the issue, feel free to open a more specific one about integration with Fava's import system
Would it be possible to remove the __suggested_accounts__
and __suggested_payees__
fields automatically when new entries are imported? They cause invalid token errors and deleting them one by one is a bit of a pain.
How about something like this? https://github.com/alexiri/fava/commit/d0d04f3dff4bedaa0bba1258bf045554d0beb2c5
@alexiri As a quick fix, it is possible to turn suggestions off by setting the suggest_accounts
and suggest_payees
arguments to False
. I am also considering to turn this off by default, see https://github.com/beancount/smart_importer/issues/50
On the long run, I would like to see a feature where fava populates the list of suggested account names based on suggestions within the importer's output. As @yagebu suggested, this should happen in a new ticket.
EDIT: I opened ticket #801 to follow up on this topic.
As discussed in #436 and #503, it would be great to have intelligent suggestions for account names when doing imports. (I hope you don't mind that I am bringing this old issue up again?).
Implementation originally was originally planned in #503:
intelligent account suggestions for fava's import process were discussed by @aumayr in #503, but the merge request was closed and the feature thus postponed.
Links to related implementations:
Some more infos and links to existing implementations in other tools (that can maybe be reused?) can be found in #436:
in the meantime, I found these additional tools, also see this list: http://plaintextaccounting.org/#data-importconversion
Technical design:
Does the current implementation of the import process save enough data so that a machine learning algorithm can learn from previous imports? Conceptually, this data would consist of mappings from original source data (e.g., a line from CSV file) to output (corresponding beancount transactions).
What would be a suitable place to store this data? I found the recommendation that importers should generate a
__source__
metadata field for each transaction. But the data could also be stored in a separate file.Interaction design, to be discussed:
Given a user started an import process by clicking "extract" next to a file that has been identified (or in the future: uploaded), when the user then looks at the extracted data... Where in the UI should the intelligent account suggestions be displayed? Some ideas: