GlobalMaksimum / sadedegel

A General Purpose NLP library for Turkish
http://sadedegel.ai
MIT License
93 stars 13 forks source link

Character ngram option for TfIdfVectorizer #251

Open dafajon opened 3 years ago

dafajon commented 3 years ago
dafajon commented 3 years ago

I have an idea influenced by hash vectorizer. Prefixes are ngrams in the tokens. Hashing them also removes dependency on vocabulary if idf is not used and given an enough feature space it is assumed all unique tokens have unique vectors. My first implementation on ngrams will be based on HashingVectorizer. After extracting ngrams in a document I will hash them. CharHashVectorizer will be the name.

husnusensoy commented 3 years ago

I also have lots of ideas :) Can you prove that it improves any of the existing models with a statistical significance?

Hüsnü Şensoy / VLDB Expert @.***

[image: Global Maksimum Data & Information Tech]

Global Maksimum Data & Information Tech +902162506637 / +902162506600 Acıbadem Mah. Çeçen Sk. Akasya Kule A-3 No:25 Kat:14 34660 Üsküdar, İstanbul Türkiye

[image: LinkedIn] https://htmlsig.com/t/0000001BRGEK6 [image: Instagram] https://htmlsig.com/t/000001DZYDYM [image: Github] https://htmlsig.com/t/000001DDAY0N

On Thu, Apr 22, 2021 at 7:31 PM Dorukhan Afacan @.***> wrote:

I have an idea influenced by hash vectorizer. Prefixes are ngrams in the tokens. Hashing them also removes dependency on vocabulary if idf is not used and given an enough feature space it is assumed all unique tokens have unique vectors. My first implementation on ngrams will be based on HashingVectorizer. After extracting ngrams in a document I will hash them. NGramHashVectorizer will be the name.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/GlobalMaksimum/sadedegel/issues/251#issuecomment-824995493, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJVFRRKVPGSZFURBYBG3NDTKBFMZANCNFSM42VWZ4QA .