Character ngram option for TfIdfVectorizer

dafajon commented 3 years ago

Using character ngrams in for TfIdf vectorized has yielded improvement in some models.
SadedeGel TfIdf vectorizer should have analyzer='char' option similar to sklearns.
It is open to discussion whether it needs idf. I will report how it affects my results.

dafajon commented 3 years ago

I have an idea influenced by hash vectorizer. Prefixes are ngrams in the tokens. Hashing them also removes dependency on vocabulary if idf is not used and given an enough feature space it is assumed all unique tokens have unique vectors. My first implementation on ngrams will be based on HashingVectorizer. After extracting ngrams in a document I will hash them. CharHashVectorizer will be the name.

husnusensoy commented 3 years ago

I also have lots of ideas :) Can you prove that it improves any of the existing models with a statistical significance?

Hüsnü Şensoy / VLDB Expert @.***

[image: Global Maksimum Data & Information Tech]

Global Maksimum Data & Information Tech +902162506637 / +902162506600 Acıbadem Mah. Çeçen Sk. Akasya Kule A-3 No:25 Kat:14 34660 Üsküdar, İstanbul Türkiye

[image: LinkedIn] https://htmlsig.com/t/0000001BRGEK6 [image: Instagram] https://htmlsig.com/t/000001DZYDYM [image: Github] https://htmlsig.com/t/000001DDAY0N

On Thu, Apr 22, 2021 at 7:31 PM Dorukhan Afacan @.***> wrote:

I have an idea influenced by hash vectorizer. Prefixes are ngrams in the tokens. Hashing them also removes dependency on vocabulary if idf is not used and given an enough feature space it is assumed all unique tokens have unique vectors. My first implementation on ngrams will be based on HashingVectorizer. After extracting ngrams in a document I will hash them. NGramHashVectorizer will be the name.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/GlobalMaksimum/sadedegel/issues/251#issuecomment-824995493, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJVFRRKVPGSZFURBYBG3NDTKBFMZANCNFSM42VWZ4QA .

GlobalMaksimum / sadedegel

Character ngram option for TfIdfVectorizer #251