character-ngrams Search Results

rapidsai/cudf #14684

[BUG] `str.character_ngrams` produces <NA> with strings < ng…

**Describe the bug** The `str.character_ngrams` function produces token `` for strings which are lesser than the provided `n` (shown in image for the case of bigrams). ![result output](https://githu…

Vortexx2 updated 8 months ago

massimoaria/bibliometrix #479

termExtraction function: Misleading documentation / bug on "…

Hi, Not sure if this is intended behaviour or not. If it IS intended, I think the documentation is misleading. The termExtraction function has a "remove.terms" argument with the following descr…

kdmaclean updated 2 months ago

scikit-learn/scikit-learn #7475

Why normalize whitespaces for CountVectorizer(analyzer='char…

I am curious of the rational of replacing consecutive whitespaces with just a single space character for [`CountVectorizer(analyzer='char')`](https://github.com/scikit-learn/scikit-learn/blob/51a765a/…

yxtay updated 2 years ago

newtfire/introDH-Hub #105

Mystery Text Discussion: pcc.txt

Post your screenshots and discuss your findings about pcc.txt here!

ebeshero updated 3 weeks ago

ParticularMiner/red_string_grouper #4

Question / suggestion to use multiple n-grams to get more fe…

Hi @ParticularMiner, Hope you are doing good. I got to work on the same project again and have a question / suggestion - would it be possible to use multiple n-grams to get more features? Like …

iibarant updated 3 years ago

rapidsai/cudf #13048

[FEA] Story - Improve performance with long strings

Many [strings APIs in libcudf](https://docs.rapids.ai/api/libcudf/stable/group__strings__apis.html) use thread-per-string parallelism in their implementation. This approach works great for processing …

GregoryKimball updated 7 months ago

scikit-learn/scikit-learn #22196

Error in TFIDF vectorizer in "char_wb" analyzer

### Discussed in https://github.com/scikit-learn/scikit-learn/discussions/22195 Originally posted by **Pruthwik** January 12, 2022 For Whitespace sensitive char-n-gram tokenization, TFIDF vect…

Pruthwik updated 2 years ago

halfdan/tatoeba2 #5

Investigate reimplementation of nihongoparserd / suggestd in…

The current setup is rather complicated. - tatodetect requires an obscure cppcms and doesn't build in the latest version. The ngrams.db generator is written in Python and rarely updated. - nihong…

halfdan updated 7 years ago

GlobalMaksimum/sadedegel #251

Character ngram option for TfIdfVectorizer

- Using character ngrams in for TfIdf vectorized has yielded improvement in some models. - SadedeGel TfIdf vectorizer should have `analyzer='char'` option similar to `sklearn`s. - It is open to disc…

dafajon updated 3 years ago

allenai/bff #3

Ngram instead of paragraph removal?

Hi @dirkgr! Here is a feature that would be very much desirable for decontamination, but I'm not sure how difficult it would be to implement into BFF: The essential part of the feature would be to …

IanMagnusson updated 1 year ago

425 results for character-ngrams

425 results
for character-ngrams