beancount / smart_importer

Augment Beancount importers with machine learning functionality.
MIT License
246 stars 28 forks source link

Allow custom getters for attributes #106

Closed sullivan-sean closed 1 year ago

sullivan-sean commented 3 years ago

This is somewhat of a follow up on https://github.com/beancount/smart_importer/issues/45

To avoid proliferation of metadata fields, I would like to keep my training metadata to a single row, so instead of having original_narration, original_payee and category, I would like to have something like:

__train__: "Amazon,Point Of Sale Withdrawal Amazon web service aws.amazon.coWAUS TLR:M03 / DRAWER:803,Shops;Digital Purchase"

And then would like to define a pipeline attribute getter that parses this combined meta field, for example:

from smart_import.pipelines import Getter, StringVectorizer
from sklearn.pipeline import make_pipeline

class MyGetter(Getter):
    def __init__(self, idx, delim=','):
        self.delim = delim
        self.idx = idx

    def _getter(self, txn):
        return txn.meta["__train__"].split(self.delim)[self.idx]

def MyPipeline(idx, delim=','):
    return make_pipeline(MyGetter(idx, delim), StringVectorizer())

class PredictPayees(EntryPredictor):
    """Predicts payees."""

    attribute = "payee"
    pipeline_getters = {"payee": MyPipeline(0) , "narration": MyPipeline(1), "category": MyPipeline(2)}
    weights = {"narration": 0.8, "payee": 0.5, "category": 0.5, "date.day": 0.1}

I don't think this results in much code duplication and the only internal change that would be necessary is to add the pipeline_getters attribute to EntryPredictor and change define_pipeline method of EntryPredictor, i.e. this line:

transformers.append((attribute, get_pipeline(attribute)))

Becomes:

pipeline = self.pipeline_getters.get(attribute, get_pipeline(attribute))
transformers.append((attribute, pipeline))

This tremendously increases the flexibility of feature extraction (as you can define custom logic based on multiple fields) and the only real internal change is to introduce this pipeline_getters attribute.

johannesjh commented 2 years ago

@tarioch and @yagebu, what do you think? (you've been involved in the past pipeline refactorings)

my opinion: thumbs up, custom getters sound reasonable, why not. Pull Request welcome, thx!

johannesjh commented 1 year ago

long time no hear... shall we close this issue?