Import: Intelligent suggestions for account names

johannesjh commented 7 years ago

As discussed in #436 and #503, it would be great to have intelligent suggestions for account names when doing imports. (I hope you don't mind that I am bringing this old issue up again?).

Implementation originally was originally planned in #503:
intelligent account suggestions for fava's import process were discussed by @aumayr in #503, but the merge request was closed and the feature thus postponed.

@aumayr commented on Feb 20, 2017 in #503: Updated TODO-List:

[ ] Support for bean-file

[ ] "Intelligent" way to suggest accounts, etc. (like discussed in #436)

[x] Support for more types of entries (balance and note seem important)

[ ] Tests

@aumayr commented on Apr 6, 2017 in #503: Only four points remain, which can be implemented after the PR is merged:

Figure out (and integrate) the duplicate-entry-detection-mechanism by bean-extract. For me, it never recognises an entry as duplicate. If duplicates are recognised, mark them in the Ingest-UI as "duplicate".

Add bean-file functionality.

The Ingest-UI is slow for >50 entries.

Add example code for how to implement an importer and configure Fava to use it. This should also contain a blueprint for a best-practice importer.

Links to related implementations:
Some more infos and links to existing implementations in other tools (that can maybe be reused?) can be found in #436:

@johannesjh commented on Jan 16, 2017 in #436: Are you aware of existing implementations? There even exists one tool written in Python.

quentinsf/icsv2ledger (written in Python)

cantino/reckon (written in Ruby)

tomszilagyi/banks2ledger (written in Clojure)

in the meantime, I found these additional tools, also see this list: http://plaintextaccounting.org/#data-importconversion

Technical design:
Does the current implementation of the import process save enough data so that a machine learning algorithm can learn from previous imports? Conceptually, this data would consist of mappings from original source data (e.g., a line from CSV file) to output (corresponding beancount transactions).

What would be a suitable place to store this data? I found the recommendation that importers should generate a __source__ metadata field for each transaction. But the data could also be stored in a separate file.

Interaction design, to be discussed:
Given a user started an import process by clicking "extract" next to a file that has been identified (or in the future: uploaded), when the user then looks at the extracted data... Where in the UI should the intelligent account suggestions be displayed? Some ideas:

fava could automatically suggest postings that balance all unbalanced transactions generated by the importer. (unless importers are expected to only generate balanced transactions?)
the dropdown list for manually selecting an account could be sorted by relevance based on the smart suggestions.

aumayr commented 7 years ago

@johannesjh Thanks for picking up where I left off, and thanks for all the infos and overview! Greatly appreciate your help!

A few answers:

Does the current implementation of the import process save enough data so that a machine learning algorithm can learn from previous imports?

The current implementation on the Fava side does not save anything. It is up to the implementation of the importers to store such data (up until now), and for example my importers do not store anything at all.

The current implementation of the Fava Import UI does, however, sort the dropdowns for the accounts based on the payee of the transaction, so some intelligence is already implemented:

https://github.com/beancount/fava/blob/ce729050593a94e4f22715b40379b0930002f0c8/fava/core/attributes.py#L55-L62

What would be a suitable place to store this data?

The __source__ metadata field for each transaction sounds reasonable. Currently this field is used for the individual importer to report to the Fava Import UI what to display next to the transaction as the "Source Code/Line" of the transaction, but it is not stored in the beancount file. This could easily be changed, but I think then we would have to introduce another "hidden" metadata field to communicate the "Source Code/Line" of the transaction from the importer to the Fava Import UI to display to the user (as the data to display (eg. a line from a CSV-file, etc.) and the data to store next to the transaction in __source__ (eg. a Hash of a CSV-row, etc.) may not be the same).

For where to store it: I'm all for keeping it all in the beancount file, in __source__ (or similar).

Fava could automatically suggest postings that balance all unbalanced transactions generated by the importer.

I think Fava should only expose helpers for the importers to use, so individual importers can decide themselves what to do.

(unless importers are expected to only generate balanced transactions?)

They are not.

The dropdown list for manually selecting an account could be sorted by relevance based on the smart suggestions.

This is already done by Fava (see above) in a "light" way, but could be hinted by the importer, which suggests a list of accounts with according relevance score to the Fava Import UI.

johannesjh commented 7 years ago

great, thank you! some notes in response:

but I think then we would have to introduce another "hidden" metadata field to communicate the "Source Code/Line" of the transaction from the importer to the Fava Import UI to display to the user (as the data to display (eg. a line from a CSV-file, etc.) and the data to store next to the transaction in source (eg. a Hash of a CSV-row, etc.) may not be the same).

I think @blais had a similar idea already, judging from the beancount.core.data.new_metadata function. E.g., the following code would create a metadata dict that includes filename, line number and source string:

from beancount.core import data
meta = data.new_metadata('filename', '10', {'__source__': 'this;is;the;original;csv;line'})

I think Fava should only expose helpers for the importers to use, so individual importers can decide themselves what to do.

To make sure I understand this right... Are you thinking of a control flow like this?

fava calls an importer
the importer reads the source file (e.g., CSV)
the importer (optionally?) uses fava (or beancount?) helpers to come up with ranked suggestions for likely account names
instead of simply returning transactions with hardcoded account names, would the importer return another data structure with lists of suggestions that users may then choose from in fava's UI?

That would provide a lot of flexibility to the importers. But I believe it would come at a significant cost: Importers would have to use a new data structure for communicating lists of suggestions back to fava (e.g., tuples of suggestions and probability values, or simply ranked suggestions). Importers would depend on fava's helpers (they currently only depend on beancount). Importers would have to care about smart suggestions and machine learning (which arguably should not be their goal?). Instead, I think we generally want to keep the importers as simple as possible because users are expected to quickly write their own import scripts for the various bank institutions that they use. So I think another control flow would be easier.

Fava calls an importer
The importer translates the source file (CSV) into beancount transactions. The importer does not implement or otherwise trigger any machine learning, but it can add metadata to assist other components where machine learning is implemented.
Fava or beancount implement the machine learning. Based on this, fava can provide smart editing features in the UI, primarily for users to modify and complete imported data, but in the future possibly also for other smart editing features.

Smart editing of imported data could comprise:

Automatic completion of missing postings if the import generates unbalanced transactions
Smartly ranked dropdown for when the user wants to change an account name
Duplicate detection (this typically occurs when I transfer money between two of my accounts and then import transactions from both accounts without using transfer accounts), as well as:
Detection of matching transactions (relevant when using transfer accounts) and the ability to document their relatedness in metadata, e.g., as implemented in pull requests 522
Probably many more usecases in the future?, e.g., Intelli-sense like suggestions in the editor.

johannesjh commented 7 years ago

Notes on machine learning:
EDIT: I am adding more notes to this post as I keep finding stuff.

Lessons Learned from GnuCash's Bayesian Classifier: I just found this interesting article with lessons learned from the GnuCash project. They added special scripts to keep the training data clean in case accounts are renamed or deleted. See: https://wiki.gnucash.org/wiki/Bayes

Properly framing the problem:
Smart editing of imported data involves multiple challenges that must be framed and approached in their own, different ways. Precisely framing the problems will hopefully help to select proper algorithms and tools.

Replacing import scripts altogether?: Having to write an import script in Python is arguably a big entry hurdle for new users. So maybe we can replace the import scripts altogether by implementing a smart importer? It would suffice to cover typical CSV imports because other, more exotic use cases can always be implemented by writing Python code. The smart importer would of course have to implement beancount.ingest.ImporterProtocol. Some of the parameters would be configured by the user, ideally in fava's GUI, while other parameters could be learned automatically:
- ImporterProtocol.name: Configured by user, useful for displaying a list of importers in the GUI.
- ImporterProtocol.identify: Configured by the user, e.g., as regex. Or made obsolete by an upload feature in fava's GUI.
- ImporterProtocol.extract: Machine learning for the following parameters:
- file encoding: autodetected or configured by user
- csv column names, compare beancount.ingest.importers.csv.Col (https://bitbucket.org/blais/beancount/src/621cec5ed38bcd128a3502a3b5c367f283deffe2/beancount/ingest/importers/csv.py?at=default), including whether there are separate columns for credit and debit, or just one column with positive or negative numbers.
- parsing strings into amounts (dealing with different thousands and decimal separators, separating amount and currency)
- all of the below smart features for suggesting account names, identifying payees, detecting duplicate or matching transactions...
- ImporterProtocol.file_account: Configured by the user, e.g., dropdown list with autocompletion for existing account names.
- ImporterProtocol.file_name: Configured by the user, e.g., as a template string where the import date can be optionally included. Could also be hardcoded so that fava files all imports using the same naming convention.
- ImporterProtocol.file_date: configured by the user (e.g., regex on filename and file contents)?
Suggesting account names for imported transactions can be framed as a text classification problem: We are dealing with supervised learning because training data exists from previous transactions. The content that we are trying to learn is textual. We probably have to preprocess the learning data in a similar way to the Lessons Learned from GnuCash's Bayesian Classifier, e.g., to exclude closed accounts. The output is a classification into categorial data (i.e., into available account names). A high-level description of the classification algorithm is described for example here on stackoverflow:
To solve your problem, here are the steps you should do:
1. Create a feature extractor - that given a description of a restaurant, returns the "features" (under the > Bag Of Words model explained above) of this restaurant (denoted as example in the literature).
2. Manually label a set of examples, each will be labeled with the desired class (Chinese, Belgian, Junk > food,...)
3. Feed your labeled examples into a learning algorithm. It will generate a classifier. From personal experience, SVM usually gives the best results, but there are other choices such as Naive Bayes, Neural > Networks and Decision Trees (usually C4.5 is used), each has its own advantage.
4. When a new (unlabeled) example (restaurant) comes - extract the features and feed it to your classifier - it will tell you what it thinks it is (and usually - what is the probability the classifier is correct).
Evaluation:
Evaluation of your algorithm can be done with cross-validation, or seperating a test set out of your labeled examples that will be used only for evaluating how accurate the algorithm is.

Optimizations:
From personal experience - here are some optimizations I found helpful for the feature extraction:
- Stemming and eliminating stop words usually helps a lot.
- Using Bi-Grams tends to improve accuracy (though increases the feature space significantly).
- Some classifiers are prone to large feature space (SVM not included), there are some ways to overcome it, such as decreasing the dimensionality of your features. PCA is one thing that can help you with it.
- Genetic Algorithms are also (empirically) pretty good for subset selection.
Suggesting payees should be framed as text classification, similar to the suggestion of account names. I.e., existing transactions with payees would be interpreted as labeled training data. Text classification can then suggest a likely payee for each newly imported transaction.
- More on a side note, the problem could also be framed as named entity recognition, as for example used to recognize parts of an address. But named entity recognition will have problems to distinguish the payee from other words in the transaction, so a text classification approach should be more promising.
Detecting duplicate transactions: The input data for duplicate detection is not purely textual but also numerical. The problem involves some domain-specific rules, such as: the transaction dates must be close to each other, typically within a few days. Also, duplicates must involve the same amount and currency. I am not sure if we should frame the problem as supervised or unsupervised learning:
- Unsupervised learning, similarity measures: We could calculate measures of similarity (e.g., Levenstein distances) between a new transaction and existing transactions.
- Using supervised learning, we would try to learn from previous user reactions. I.e., "given a user has classified these to transactions as duplicates, does this suggest these other two transactions to be duplicates as well?". This is the approach taken by the Dedupe library. For fava, this would imply that we would somehow have to keep track of transactions that were not imported because the user flagged them as duplicates.
Suggesting linked transactions: A similar problem to duplicate detection. When using transfer accounts, it would be nice to have suggestions for linking corresponding transactions, e.g., as implemented in pull requests 522.

Choosing the right tools

Scikit-learn seems to be the most popular general-purpose machine-learning tool in Python, compare the overview given in this blogpost: The Best Machine Learning Libraries in Python (2015). The scikit-learn homepage provides a lot of examples and documentation. Unfortunately, it is a rather heavyweight download of 7.6MB. A nice overview of available algorithms can be found on the http://www.dataschool.io/learn/ homepage:

sklearn_algorithms

TextBlob is focused on text analysis, as its name implies. Much smaller size of 634kB, plus 1.2MB for the underlying NLTK package.

TextBlob is a new python natural language processing toolkit, which stands on the shoulders of giants like NLTK and Pattern, provides text mining, text analysis and text processing modules for python developers.
Dedupe and CsvDedupe seem to be popular and convenient tools for duplicate detection.
A DIY approach (as currently taken in beancount and fava) would also be a viable option, and would avoid depending on large, general-purpose machine learning libs.
- Smart suggestions for account names, see: ExponentialDecayRanker in fava/util/ranking.py
- Detecting duplicate transactions (by finding similar transactions, i.e., using unsupervised learning), see: find_similar_entries in beancount/ingest/similar.py
- One viable approach would be to quickly prototype a solution, e.g., using Scikit-learn or RapidMiner, which we could then implement without depending on heavy machine learning libraries.

Examples:

Classification of textual data
Classification of data with both textual and numerical features
Supervised duplicate detection (diy approach using pandas and scikit-learning)
Supervised duplicate detection (multiple examples using the dedup library)
Learning from mixed (textual and numeric) data: Feature Union with Heterogeneous Data Sources

aumayr commented 7 years ago

To make sure I understand this right... Are you thinking of a control flow like this?

Yes, exactly. I think this discussion is vital (and I do not have strong opinions for either solution), because it will determine how useful this becomes.

The "Fava-does-it-all"-approach might lead to many "quick wins" for existing plugins, and hides complexity from the user/developer, but it might not lead to perfect results.

The "Importer-does-it-with-the-help-of-Fava"-approach is more work and headache for the user/developer, but it can adapt to the data structures and information at hand, leading to better results.

I think we should discuss both approaches, to the point of discussing how the "interfaces" (helpers from Fava/interface between Fava and the Importer) look, to get a better feeling which way this should go.

Choosing the right tools

As this might lead to more overhead (with Scikit-learn for example), this should be an optional feature/install IMHO, like the Excel-export feature is right now. If the user want's to use these powerful frameworks, he/she can install the required dependencies and it becomes available.

johannesjh commented 7 years ago

agreed, we should discuss and weigh both approaches.

One more idea: The "Importer-does-it-with-the-help-of-Fava"-approach opens up another promising strategy: By implementing a smart importer that covers typical CSV import usecases, we could eliminate the need for users to write import scripts altogether. As a result, instead of implementing an importer, users would configure an importer in the GUI, which they would then train during usage. Possible flow of user interactions:

In fava's GUI, click to create a new importer.
A dialogue opens with configuration options for the new importer (e.g., name of importer, account name, ...). User can save (and later edit) such configurations.
Users train the importer by simply starting the import process and by correcting the imported data.

wouldn't that be nice! ;-)

blais commented 7 years ago

Improving the built-in CSV importer is part of the plan. I think it should be part of Beancount itself. It's already configurable.

See notes here: https://bitbucket.org/blais/beancount/src/621cec5ed38bcd128a3502a3b5c367f283deffe2/TODO?at=default&fileviewer=file-view-default#TODO-1184 https://bitbucket.org/blais/beancount/src/621cec5ed38bcd128a3502a3b5c367f283deffe2/beancount/ingest/importers/csv.py?at=default&fileviewer=file-view-default

On Thu, Sep 14, 2017 at 10:30 AM, johannesjh notifications@github.com wrote:

agreed, we should discuss and weigh both approaches.

One more idea: The "Importer-does-it-with-the-help-of-Fava"-approach opens up another promising strategy: By implementing a smart importer that covers typical CSV import usecases, we could eliminate the need for users to write import scripts altogether. As a result, instead of implementing an importer, users would configure an importer in the GUI, which they would then train during usage. Possible flow of user interactions:

In fava's GUI, click to create a new importer.

A dialogue opens with configuration options for the new importer (e.g., name of importer, account name, ...). User can save (and later edit) such configurations.

Users train the importer by simply starting the import process and by correcting the imported data.

wouldn't that be nice! ;-)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/beancount/fava/issues/579#issuecomment-329412228, or mute the thread https://github.com/notifications/unsubscribe-auth/AAUgk0nJLuyGqHqs0gs8Qqo3lG1ebEvvks5siOQQgaJpZM4PNqX6 .

aumayr commented 7 years ago

One more idea

This is the way to go IMHO. An importer should be as "small" as possible, but with the possibility of many callbacks that hook into/tweak the Fava-part of the importer. So if someone has a really strange set of CSV-files to deal with or needs some logic for how to skip lines (eg. I saw a poorly designed CSV-file with every 25th row being some sort of "sum", suggesting it rendered a paginated view).

I think it should be part of Beancount itself.

Partly: Fava differs from Beancount as it needs more information to display the correct UI, eg. a list of suggested accounts, and maybe even react on what the user did already input.

johannesjh commented 6 years ago

I built a prototype together with @heerpa, a friend of mine. Please have a look at the following iPython notebook: https://gist.github.com/johannesjh/956179856957348e4fad48514b9824fc https://nbviewer.jupyter.org/gist/johannesjh/956179856957348e4fad48514b9824fc

The prototype uses scikit-learn to train an SVM classifier with beancount example data. The algorithm learns from multiple properties of a transaction, including narration, payee, day of week, and day of month. Based on previously learned data, the classifier predicts the most likely account name for any new transaction to be imported. It also generates ranked suggestions suitable for populating dropdown lists in the UI.

It would be great if you could provide some feedback so we can arrange further steps towards building the smart importer. Thank you!

PS: To run an executable and editable version of the notebook:

virtualenv -p python3 ~/somefolder/virtualenv
source ~/somefolder/virtualenv/bin/activate
pip3 install beancount jupyter scikit-learn scipy numpy ipython
jupypter notebook beancount-machine-learning3.ipynb

johannesjh commented 6 years ago

Hi!

I integrated the prototype from my previous post into beancount, see this repository: https://github.com/johannesjh/smart_importer

Using the machine learning functionality from the prototype (now in machinelearning.py), I built two possible integrations:

An extension of beancount.ingest.importers.csv.Importer, see smart_csv_importer.py
A decorator for an importer's extract function, see predict_postings.py

I tried it out with real-world data and it worked just fine for my purposes. But I would love to see the feature upstream in beancount or fava, and I would really need your feedback regarding if and how we to do the integration. Thanks! Johannes

yagebu commented 6 years ago

Hey, sorry for not giving any feedback on this so far, I've been quite busy the last few months. In about ten days I'll have more free time and will give you some feedback. Thanks already for the work on this!

yagebu commented 6 years ago

I think the decorator for the extract function is the way to go forward for this. This would allow all kinds of importers to benefit from this. Account suggestions could then be passed as metadata, which Fava could display in the completion dropdown. I guess something like a __completions__ list on the postings would be good. Would you want to use these completion suggestions only for the imports or for all account completions?

I haven't gotten around to testing it myself as I think it's unsuitable for my usecase: When importing I first "clean" up the payee names after which the current account completions are usually perfect. Would your implementation also work to clean up the account names (without training data, as my file only contains the "cleaned" payee names)?

johannesjh commented 6 years ago

I think the decorator for the extract function is the way to go forward for this. This would allow all kinds of importers to benefit from this.

I also like how the decorator version allows to add the machine learning predictions as a cross-cutting concern. => I removed the SmartCsvImporter class in favor of the decorator.

Account suggestions could then be passed as metadata, which Fava could display in the completion dropdown. I guess something like a __completions__ list on the postings would be good.

thank you, I like the suggestion. I can implement that, just need to take time for it.

I think the decorator should furthermore differentiate between predictions and suggestions being made:

predicting account names means that the decorator adds predicted second postings to the import
suggesting account names leaves the import data untouched, but only adds __completions__ metadata, leaving it up to the user to add second postings.

I haven't gotten around to testing it myself as I think it's unsuitable for my usecase: When importing I first "clean" up the payee names after which the current account completions are usually perfect. Would your implementation also work to clean up the account names (without training data, as my file only contains the "cleaned" payee names)?

Can you explain a bit more about your workflow, please:

In your current workflow, do you clean up the payee names by hand?, e.g., by editing a .csv file prior to starting the import?
My bank's .csv does not contain payees at all, so I would also be interested in predicting the payee. I think this is easiest implemented as a separate decorator that trains a separate machine learning model.

Note: It is easiest to train a separate machine learning model because only very few scikit-learn algorithms support multiple outputs. In scikit-learn terminology, the second posting's account is one multiclass output, and the payee is another multiclass output, so the algorithm would need to support multiclass-multioutput classification, compare http://scikit-learn.org/stable/modules/multiclass.html

yagebu commented 6 years ago

I think the decorator should furthermore differentiate between predictions and suggestions being made

Would this be something the user specifies when they use the decorator, or does the ML model allow you to make this distinction (I guess it would be a prediction if the model is certain that there's only one possible candidate, right?).

Can you explain a bit more about your workflow, please

My bank's csv file contains the payees as they are listed on my bank statement, which means they are sometimes written in a different way (e.g. all caps or abbreviated) than I want or contain extra useless information that I don't want in my ledger. I go through them by hand in Fava's import screen and just type in the payees that I want, which is quite fast thanks to the autocompletion. Since fava's account autocompletion uses the current payee name to suggest matching accounts, the first suggested account is then usually the correct one.

johannesjh commented 6 years ago

thank you for the explanations regarding your workflow. the implementation currently predicts the account names of missing postings (as indicated by the @PredictPostings decorator class name). I think this could be useful in your scenario as well because the missing postings can be predicted based on messy payees as well, so you would not need to edit/enter the payees by hand in order to get predictions of account names.

Besides, if you would like to automatically predict nice payee names, it would be easy to write an additional @PredictPayees decorator class. In fact, adding more decorators is now much easier because I refactored and cleaned up the code tonight: I can now feed beancount transactions directly into scikit-learn pipelines, this reduced a lot of glue code.

Regarding predictions vs suggestions: In my current implementation, the behavior depends on what the user specifies when they use the decorator.

I imagined that some users who want manual control may not want the decorator to modify their data, but may prefer to enter data by hand (with the assistance of smart suggestions). These users can disable predicted postings and enable suggestions.
Other users like myself may prefer the decorator to add predictions, that they can modify if needed.

johannesjh commented 6 years ago

Update: I added a PredictPayees decorator, see https://github.com/johannesjh/smart_importer/blob/master/predict_payees.py

tarioch commented 6 years ago

@johannesjh I really like the work you started. For me this is something that could(should) live outside of fava as it's a really useful thing to have without using fava. Right now I'm still struggling to get it working as it seems to require python 3.6 and I have an issue where fava does not run on 3.6

johannesjh commented 6 years ago

Hi Patrick! I guess your problems with python3.6 are resolved since you have been busy working with the decorator and even submitted two pull requests, right? Thx, Johannes

tarioch commented 6 years ago

Yep, got it working, started now playing around with it :)

johannesjh commented 6 years ago

@aumayr @yagebu: I'd like to bring this issue up again and talk about next steps, how to integrate the machine learning functionality with fava.

Current status: The smart_importer project has worked well in practice, both @tarioch and myself have been using it actively for quite a while now. Importers can be decorated with @PredictPostings or @PredictPayees in order to get predictions (autocompletion of the most likely value) and suggestions (a ranked list of likely values to choose from) for missing second postings and for missing payees.

Integration with fava: The most important topic for integration are suggestions, i.e., fava could populate its dropdown lists with suggested values provided by a smart_importer. To get this to work, we would have to agree how the smart importer should provide suggestions to fava. For example, as metadata fields? The current implementation writes suggestions into metadata fields called __suggested_accounts__ and __suggested_payees__.

Integration with beancount: Somewhat off-topic in this post, but @blais: I'd love to hear your opinion on how to proceed further, and also where the smart_importer code should live in the long term. Can we turn it into an official beancount feature, part of beancount's code base?

What do you think?

blais commented 6 years ago

On Tue, Apr 17, 2018 at 4:54 PM, johannesjh notifications@github.com wrote:

@aumayr https://github.com/aumayr @yagebu https://github.com/yagebu: I'd like to bring this issue up again and talk about next steps, how to integrate the machine learning functionality with fava.

Current status: The smart_importer https://github.com/johannesjh/smart_importer project has worked well in practice, both @tarioch https://github.com/tarioch and myself have been using it actively for quite a while now. Importers can be decorated with @PredictPostings or @PredictPayees in order to get predictions (autocompletion of the most likely value) and suggestions (a ranked list of likely values to choose from) for missing second postings and for missing payees.

Thanks for the pointer, I didn't know about the project. Looks really great! :-) I should start to use it.

Integration with fava: The most important topic for integration are

suggestions, i.e., fava could populate its dropdown lists with suggested values provided by a smart_importer. To get this to work, we would have to agree how the smart importer should provide suggestions to fava. For example, as metadata fields? The current implementation writes suggestions into metadata fields called suggested_accounts and suggested_payees.

Integration with beancount: Somewhat off-topic in this post, but @blais https://github.com/blais: I'd love to hear your opinion on how to proceed further, and also where the smart_importer https://github.com/johannesjh/smart_importer code should live in the long term. Can we turn it into an official beancount feature, part of beancount's code base?

What do you think?

Mainly because of the scikit-learn dependency, I think it's best to continue maintaining it separately for now. (Be assured that at this point the Beancount schema/data structures are really quite unlikely to change, so it won't be difficult to maintain.)

I'd like to make it easier to register in hooks for running these within Beancount, but given the importers are almost always custom, it doesn't seem to be a huge problem to integrate them the way you did.

Do you see any particular reason it should prefer to live within the Beancount codebase? (It looks like really clean and simple integration right now.) Is there anything I could change to Beancount to make this integration easier ? If so, what would that be?

I suppose visibility and a sense that things are "well integrated" might be a reason to move this in. Perhaps a simpler thing that can be done is to move it under the Beancount organization to give it a bit more exposure (github.com/beancount) and a more "official" look.

Thoughts?

aumayr commented 6 years ago

@johannesjh Awesome work!!

Moving it to https://github.com/beancount/smart_importer is no problem. Just tell if you want to do that, and I create the repo for you.

I do think it should live in it's own repo & package, but should be integrated with Beancount & Fava as an optional install like the Excel-support in Fava is: https://github.com/beancount/fava/blob/master/setup.py#L44-L54

This way, users that want to benefit from these awesome features and want to install all the dependencies can do a simple pip install beancount[smart_importer] or pip install fava[smart_importer]. By integrating it into the setup.py of Beancount & Fava it becomes a "first-class citizen" of these projects, but stays in it's own repo (for clear division of concerns and mainability, etc.).

As for the integration in the Fava importer (and "Add Transaction"-form in general): @yagebu What do you think about the idea with the __suggested_accounts__ and __suggested_payees__ metadata fields? Or should we extend the data structures to hold that data? Should we provide a hook for asking an external tool (like the smart_importer) about suggestions and implement the current suggestions-mechanism with this hook too?

johannesjh commented 6 years ago

Hi, thank you for your feedback, I am glad you like it, and thank you Martin for the post on the mailinglist! I agree with both of your posts. I guess that leaves us with the following todos:

[x] Move the smart_importer repo into the beancount group on github. - Thank you for the offer, I would love to see that! Current contributors are myself and Patrick @tarioch, additional contributors are welcome! ;-). - DONE, see beancount/smart_importer
[ ] We can define in this ticket how fava should load suggestions from a smart importer, e.g., through __suggested_accounts__ and __suggested_payees__ metadata fields.
[ ] Add smart_importer as optional dependency to fava, that would be nice, thank you Dominik @aumayr for your suggestion. That only makes sense once the above point is clarified.
[ ] Add smart_importer as optional dependency to beancount, if Martin @blais agrees to that?

xentac commented 6 years ago

I second @blais' suggestion for making it work with emacs suggestions. That way I can do a similar thing with vim-beancount and get it through my workflow :)

johannesjh commented 6 years ago

Regarding integration with text editors: I created a ticket for it, see https://github.com/johannesjh/smart_importer/issues/32

aumayr commented 6 years ago

@johannesjh I added you to the beancount org on Github, so you can transfer the repo over.

johannesjh commented 6 years ago

thank you, I moved the repository over to beancount/smart_importer.

aumayr commented 6 years ago

@yagebu Any thoughts on the Fava-integration-part?

yagebu commented 6 years ago

What do you think about the idea with the __suggested_accounts__ and __suggested_payees__ metadata fields? Or should we extend the data structures to hold that data?

Storing it as metadata sounds good.

Should we provide a hook for asking an external tool (like the smart_importer) about suggestions and implement the current suggestions-mechanism with this hook too?

A hook would be fine - for me the current suggestions work really well though. With imports one has to deal with messier data compared to the transaction form. In any case, I think the current mechanism should still form the basis of the suggestions, e.g., in case smart_importer doesn't suggest the right accounts we should still list all accounts.

I don't think much would be gained by adding it as an optional dependency - it's just a single package to install anyway.

yagebu commented 6 years ago

I tried using smart_importer recently and in failed in various places so I gave up, might try again in the future ;)

I believe the main point of this issue has been addressed by smart_importer so I'll close the issue, feel free to open a more specific one about integration with Fava's import system

alexiri commented 6 years ago

Would it be possible to remove the __suggested_accounts__ and __suggested_payees__ fields automatically when new entries are imported? They cause invalid token errors and deleting them one by one is a bit of a pain.

How about something like this? https://github.com/alexiri/fava/commit/d0d04f3dff4bedaa0bba1258bf045554d0beb2c5

johannesjh commented 6 years ago

@alexiri As a quick fix, it is possible to turn suggestions off by setting the suggest_accounts and suggest_payees arguments to False. I am also considering to turn this off by default, see https://github.com/beancount/smart_importer/issues/50

On the long run, I would like to see a feature where fava populates the list of suggested account names based on suggestions within the importer's output. As @yagebu suggested, this should happen in a new ticket.

EDIT: I opened ticket #801 to follow up on this topic.

beancount / fava

Import: Intelligent suggestions for account names #579