-
**Describe the bug**
The `str.character_ngrams` function produces token `` for strings which are lesser than the provided `n` (shown in image for the case of bigrams).
![result output](https://githu…
-
Hi,
Not sure if this is intended behaviour or not. If it IS intended, I think the documentation is misleading.
The termExtraction function has a "remove.terms" argument with the following descr…
-
I am curious of the rational of replacing consecutive whitespaces with just a single space character for [`CountVectorizer(analyzer='char')`](https://github.com/scikit-learn/scikit-learn/blob/51a765a/…
yxtay updated
2 years ago
-
Post your screenshots and discuss your findings about pcc.txt here!
-
Hi @ParticularMiner,
Hope you are doing good.
I got to work on the same project again and have a question / suggestion - would it be possible to use multiple n-grams to get more features? Like …
-
Many [strings APIs in libcudf](https://docs.rapids.ai/api/libcudf/stable/group__strings__apis.html) use thread-per-string parallelism in their implementation. This approach works great for processing …
-
### Discussed in https://github.com/scikit-learn/scikit-learn/discussions/22195
Originally posted by **Pruthwik** January 12, 2022
For Whitespace sensitive char-n-gram tokenization, TFIDF vect…
-
The current setup is rather complicated.
- tatodetect requires an obscure cppcms and doesn't build in the latest version. The ngrams.db generator is written in Python and rarely updated.
- nihong…
-
- Using character ngrams in for TfIdf vectorized has yielded improvement in some models.
- SadedeGel TfIdf vectorizer should have `analyzer='char'` option similar to `sklearn`s.
- It is open to disc…
-
Hi @dirkgr! Here is a feature that would be very much desirable for decontamination, but I'm not sure how difficult it would be to implement into BFF:
The essential part of the feature would be to …