clips / wordkit

Featurize words into orthographic and phonological vectors.
GNU General Public License v3.0
40 stars 10 forks source link

Add custom FeatureUnion #2

Closed stephantul closed 6 years ago

stephantul commented 6 years ago

We currently use the sklearn.pipeline.FeatureUnion to combine different featurizers and corpora. This works great! But we want to replace it to add:

  1. Merging different sources (e.g. merging the frequency table for a word from one data source with perceptual characteristics for the same word from another corpus).
  2. Adding weights to transformers (e.g. assigning a weight of .5 to a phonology transformer to reduce the weight of phonology in any distance calculations).

The first point definitely makes a lot of sense and will be added ASAP, but I'm not sure about the second one.

stephantul commented 6 years ago

I've added the CorpusAugmenter class, which is the solution to option 1. in the post above. It can successfully add information to other corpora.

The most useful thing you can currently do with it is add up-to-date frequency norms from SUBTLEX to CELEX. Other useful things include adding phonological information to other corpora using CELEX.

Point 2. is no longer relevant, so I'm dropping that one.