TutteInstitute / vectorizers

Vectorizers for a range of different data types
BSD 3-Clause "New" or "Revised" License
93 stars 23 forks source link

Add Compression Vectorizers #87

Closed lmcinnes closed 2 years ago

lmcinnes commented 2 years ago

Add Lempel-Ziv and Byte Pair Encodign based vectorizers allowing for vectorization of non-tokenized strings.

Also includes a basic outline of a distribution vectorizer, but this may require something more powerful than pomegranate.

cjweir commented 2 years ago

Do we want to include the data frame distribution vectorizer in this PR? I got the impression that it didn't work properly with pomogranate.

lmcinnes commented 2 years ago

The framework is useful, but transitioning to something other than pomegranate will be required. Having the skeleton in place for now probably doesn't hurt us.

cjweir commented 2 years ago

Cool. I have no problem adding it - just thought I'd ask.