Closed tbm closed 5 years ago
I've never used machine learning but this post suggests that you should not assign weights manually: https://stackoverflow.com/questions/38034702/how-to-put-more-weight-on-certain-features-in-machine-learning
Removing the weight assignment makes my test case work.
that is, predict_postings works. predict_payees fails after removing weights.
With this change, removing all weights also works for payee:
- ('payee', Pipeline([ # any existing payee, if one exists
- ('getPayee', ml.GetPayee()),
- ('vect', CountVectorizer(ngram_range=(1, 3))),
- ])),
I'm not sure if this has any negative effects though. It seems a bit weird that removing this makes it work. Then again how can the payee be used to match the payee? If the payee is set in test_data, it's taken as is anyway (which seems wrong to me).
Anyway, I've no idea what I'm doing. I'm just tapping around in the dark ;)
Removing weights: I think we set weights based on a subjective evaluation of the relative importance of input data. We certainly did no formal evaluation. I remember seeing examples online for how to tune the weights. I googled for it and found these:
...that is probably an approach we could take, without too much effort. If the grid search takes a long time, we could persistently cache optimized weights for a given importer configuration. I think this would certainly be an improvement.
Note: I have not considered optimized weights to be particularly important though. I previously tried reckon and found it worked well despite the much simpler Bayes'ian classifier used in its implementation.
Payee as input for classifying payees: I think this does make sense in situations where the intended output is similar but not identical to training data. For example, the output of an importer may include payee fields with proper payee information in a messy format. A classifier can be trained based on this information to suggest and/or predict cleanly formatted and consistent payees.
Failing test case: Can this be due to the very small training data set?
The recent refactoring by @yagebu now made it very easy to tune weights as you wish. I guess this mostly solves this issue.
@tbm: Could you try the current version please and see if it works for you.
Yeah, adding a test case with an empty narration works now.
I edited
tests/predict_postings_test.py
to add some more test cases. I added the following to test an empty narration:It predicted
Expenses:Food:Groceries
even though looking at the training data it seems clear to me that the result should beExpenses:Food:Restaurant
.