TutteInstitute / vectorizers

Vectorizers for a range of different data types
BSD 3-Clause "New" or "Revised" License
93 stars 23 forks source link

Count feature compressor #72

Closed lmcinnes closed 3 years ago

lmcinnes commented 3 years ago

It turns our that our data-prep tricks prior and after SVDs for count-based data are generically useful. I tried applying them to info-weighted bag-of-words on 20-newsgroups instead of just a straight SVD and ...

image

I decided this is going to be too generically useful not to turn into a standard transformer. In due course we can potentially use this as part of a pipeline for word vectors instead of the reduce_dimension method we have now.

codecov-commenter commented 3 years ago

Codecov Report

Merging #72 (c57e097) into master (d4810ee) will increase coverage by 0.08%. The diff coverage is 93.68%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #72      +/-   ##
==========================================
+ Coverage   89.90%   89.98%   +0.08%     
==========================================
  Files          19       19              
  Lines        3498     3576      +78     
  Branches      658      667       +9     
==========================================
+ Hits         3145     3218      +73     
- Misses        298      303       +5     
  Partials       55       55              
Impacted Files Coverage Δ
vectorizers/token_cooccurrence_vectorizer.py 88.70% <85.18%> (-0.30%) :arrow_down:
vectorizers/transformers.py 93.86% <95.91%> (+0.38%) :arrow_up:
vectorizers/tests/test_transformers.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update d4810ee...c57e097. Read the comment docs.

lmcinnes commented 3 years ago

For reference here is how it compares with the straight SVD version (and TFIDF) The "+" versions use, effectively, this transformer. Also note that for this run a large S-BERT model was being used, so S-BERT is a lot better here.

image

jc-healy commented 3 years ago

Wow, infoWeight+ beats supervised infoWeight. That's impressive.

Nice job Leland.

On Fri, Jul 30, 2021 at 5:31 PM Leland McInnes @.***> wrote:

For reference here is how it compares with the straight SVD version (and TFIDF) The "+" versions use, effectively, this transformer. Also note that for this run a large S-BERT model was being used, so S-BERT is a lot better here.

[image: image] https://user-images.githubusercontent.com/11962885/127713401-e6a872df-18b3-4ede-a162-b3a4fa3b0c44.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/TutteInstitute/vectorizers/pull/72#issuecomment-890164316, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3IUWQYGRMR6UIIK4R372TT2MK3JANCNFSM5BJG26VQ .