Closed stephantul closed 6 years ago
I've added the CorpusAugmenter
class, which is the solution to option 1. in the post above. It can successfully add information to other corpora.
The most useful thing you can currently do with it is add up-to-date frequency norms from SUBTLEX to CELEX. Other useful things include adding phonological information to other corpora using CELEX.
Point 2. is no longer relevant, so I'm dropping that one.
We currently use the
sklearn.pipeline.FeatureUnion
to combine different featurizers and corpora. This works great! But we want to replace it to add:frequency
table for a word from one data source with perceptual characteristics for the same word from another corpus)..5
to a phonology transformer to reduce the weight of phonology in any distance calculations).The first point definitely makes a lot of sense and will be added ASAP, but I'm not sure about the second one.