TutteInstitute / vectorizers

Vectorizers for a range of different data types
BSD 3-Clause "New" or "Revised" License
93 stars 23 forks source link

EMTokenCooccurrence update #53

Closed cjweir closed 3 years ago

cjweir commented 3 years ago

Updated EM TCV. A few of the changes effecting other things are removing the triangular kernel, changing the name of the negative binomial kernel to geometric (because that's what it is really), and updating all the kernel params to remove the 'expected window size) that the triangular kernel needed. The kernel functions now also return float64... it's faster and cleaner to work with them and just cast them to float32's at the end when building the matrices. TODO's after: The EM TCV still needs to have n-grams added, and I need to update the skip gram vectorizer to use the new and improved processing before I can replace the TCV with the EM TCV and have only one.