Closed lmcinnes closed 3 years ago
Merging #72 (c57e097) into master (d4810ee) will increase coverage by
0.08%
. The diff coverage is93.68%
.
@@ Coverage Diff @@
## master #72 +/- ##
==========================================
+ Coverage 89.90% 89.98% +0.08%
==========================================
Files 19 19
Lines 3498 3576 +78
Branches 658 667 +9
==========================================
+ Hits 3145 3218 +73
- Misses 298 303 +5
Partials 55 55
Impacted Files | Coverage Δ | |
---|---|---|
vectorizers/token_cooccurrence_vectorizer.py | 88.70% <85.18%> (-0.30%) |
:arrow_down: |
vectorizers/transformers.py | 93.86% <95.91%> (+0.38%) |
:arrow_up: |
vectorizers/tests/test_transformers.py | 100.00% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update d4810ee...c57e097. Read the comment docs.
For reference here is how it compares with the straight SVD version (and TFIDF) The "+" versions use, effectively, this transformer. Also note that for this run a large S-BERT model was being used, so S-BERT is a lot better here.
Wow, infoWeight+ beats supervised infoWeight. That's impressive.
Nice job Leland.
On Fri, Jul 30, 2021 at 5:31 PM Leland McInnes @.***> wrote:
For reference here is how it compares with the straight SVD version (and TFIDF) The "+" versions use, effectively, this transformer. Also note that for this run a large S-BERT model was being used, so S-BERT is a lot better here.
[image: image] https://user-images.githubusercontent.com/11962885/127713401-e6a872df-18b3-4ede-a162-b3a4fa3b0c44.png
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/TutteInstitute/vectorizers/pull/72#issuecomment-890164316, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3IUWQYGRMR6UIIK4R372TT2MK3JANCNFSM5BJG26VQ .
It turns our that our data-prep tricks prior and after SVDs for count-based data are generically useful. I tried applying them to info-weighted bag-of-words on 20-newsgroups instead of just a straight SVD and ...
I decided this is going to be too generically useful not to turn into a standard transformer. In due course we can potentially use this as part of a pipeline for word vectors instead of the
reduce_dimension
method we have now.