Closed jc-healy closed 1 year ago
Merging #110 (b251b38) into master (51af4a7) will increase coverage by
0.08%
. The diff coverage is100.00%
.
:mega: This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more
@@ Coverage Diff @@
## master #110 +/- ##
==========================================
+ Coverage 91.09% 91.17% +0.08%
==========================================
Files 34 34
Lines 4774 4829 +55
==========================================
+ Hits 4349 4403 +54
- Misses 425 426 +1
Impacted Files | Coverage Δ | |
---|---|---|
vectorizers/_version.py | 100.00% <100.00%> (ø) |
|
vectorizers/ngram_vectorizer.py | 92.94% <100.00%> (+2.03%) |
:arrow_up: |
vectorizers/tests/test_common.py | 99.63% <100.00%> (-0.18%) |
:arrow_down: |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
This PR was coded via pair programming with @hamelin
You should, in fact, be able to to fit this into a dask distributed model that creates a list of dask delayed outputs of fit (run on different partitions on different nodes) and then sums those all together (providing you have a suitable NgramVectorizer as a start value).
Good point Leland. That should be a fun addition to the toolkit and should speed up a bunch of our pipelines which the vectorization is one of the more expensive steps.
I should change the behaviour when trying to add an unfitted model to fitted data to simply return the fitted data instead of throwing a "model not fitted error".
That will make for easier fitting loops as well make the process more robust in the cases where an empty document list might show up mid process.
When adding two NgramVectorizers together we align their column dictionaries, reindex the second _train_matrix, reshape the matrices appropriately and stack them together. This should let us more easily perform NgramVectorization in sequential batch jobs.