clips / wordkit

Featurize words into orthographic and phonological vectors.
GNU General Public License v3.0
40 stars 10 forks source link

Rewrite feature extraction from functions to objects #6

Closed stephantul closed 6 years ago

stephantul commented 6 years ago

The feature extraction functions are currently all just functions. Making them objects would cause some serious reduction in overhead, and some more options for expansion in the future.

Currently, users have to do something like:

all_phonemes = get_characters(data, field='phonology')
features = extract_one_hot_phonemes(all_phonemes)
o = ONCTransformer(features)
X = o.fit_transform(data)

This isn't too bad, but can be simplified by making the extraction process above atomic:

all_phonemes = extract_one_hot_phonemes(data, field='phonology')

But this would require adding the same couple of lines of code to all extraction functions.

So, what I propose is:

o = ONCTransformer(OneHotPhonemeExtractor(), field='phonology')
X = o.fit_transform(data)

This merges the process of extracting the relevant phonemes from the data, and allows us to use inheritance for e.g. type checking, chaining etc. I think it also puts less of a burden on the user, who no longer has to separately keep track of the features.

stephantul commented 6 years ago

A tentative proposal is located in the feature extraction branch.