TutteInstitute / vectorizers

Vectorizers for a range of different data types
BSD 3-Clause "New" or "Revised" License
93 stars 23 forks source link

Overloaded addition for NgramVectorizer for easy batch processing #110

Closed jc-healy closed 1 year ago

jc-healy commented 1 year ago

When adding two NgramVectorizers together we align their column dictionaries, reindex the second _train_matrix, reshape the matrices appropriately and stack them together. This should let us more easily perform NgramVectorization in sequential batch jobs.

codecov-commenter commented 1 year ago

Codecov Report

Merging #110 (b251b38) into master (51af4a7) will increase coverage by 0.08%. The diff coverage is 100.00%.

:mega: This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##           master     #110      +/-   ##
==========================================
+ Coverage   91.09%   91.17%   +0.08%     
==========================================
  Files          34       34              
  Lines        4774     4829      +55     
==========================================
+ Hits         4349     4403      +54     
- Misses        425      426       +1     
Impacted Files Coverage Δ
vectorizers/_version.py 100.00% <100.00%> (ø)
vectorizers/ngram_vectorizer.py 92.94% <100.00%> (+2.03%) :arrow_up:
vectorizers/tests/test_common.py 99.63% <100.00%> (-0.18%) :arrow_down:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

jc-healy commented 1 year ago

This PR was coded via pair programming with @hamelin

lmcinnes commented 1 year ago

You should, in fact, be able to to fit this into a dask distributed model that creates a list of dask delayed outputs of fit (run on different partitions on different nodes) and then sums those all together (providing you have a suitable NgramVectorizer as a start value).

jc-healy commented 1 year ago

Good point Leland. That should be a fun addition to the toolkit and should speed up a bunch of our pipelines which the vectorization is one of the more expensive steps.

jc-healy commented 1 year ago

I should change the behaviour when trying to add an unfitted model to fitted data to simply return the fitted data instead of throwing a "model not fitted error".

That will make for easier fitting loops as well make the process more robust in the cases where an empty document list might show up mid process.