Closed lmcinnes closed 2 years ago
Do we want to include the data frame distribution vectorizer in this PR? I got the impression that it didn't work properly with pomogranate.
The framework is useful, but transitioning to something other than pomegranate will be required. Having the skeleton in place for now probably doesn't hurt us.
Cool. I have no problem adding it - just thought I'd ask.
Add Lempel-Ziv and Byte Pair Encodign based vectorizers allowing for vectorization of non-tokenized strings.
Also includes a basic outline of a distribution vectorizer, but this may require something more powerful than pomegranate.