Open dafajon opened 3 years ago
I have an idea influenced by hash vectorizer. Prefixes are ngrams in the tokens. Hashing them also removes dependency on vocabulary if idf
is not used and given an enough feature space it is assumed all unique tokens have unique vectors. My first implementation on ngrams will be based on HashingVectorizer
. After extracting ngrams in a document I will hash them. CharHashVectorizer
will be the name.
I also have lots of ideas :) Can you prove that it improves any of the existing models with a statistical significance?
Hüsnü Şensoy / VLDB Expert @.***
[image: Global Maksimum Data & Information Tech]
Global Maksimum Data & Information Tech +902162506637 / +902162506600 Acıbadem Mah. Çeçen Sk. Akasya Kule A-3 No:25 Kat:14 34660 Üsküdar, İstanbul Türkiye
[image: LinkedIn] https://htmlsig.com/t/0000001BRGEK6 [image: Instagram] https://htmlsig.com/t/000001DZYDYM [image: Github] https://htmlsig.com/t/000001DDAY0N
On Thu, Apr 22, 2021 at 7:31 PM Dorukhan Afacan @.***> wrote:
I have an idea influenced by hash vectorizer. Prefixes are ngrams in the tokens. Hashing them also removes dependency on vocabulary if idf is not used and given an enough feature space it is assumed all unique tokens have unique vectors. My first implementation on ngrams will be based on HashingVectorizer. After extracting ngrams in a document I will hash them. NGramHashVectorizer will be the name.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/GlobalMaksimum/sadedegel/issues/251#issuecomment-824995493, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJVFRRKVPGSZFURBYBG3NDTKBFMZANCNFSM42VWZ4QA .
analyzer='char'
option similar tosklearn
s.idf
. I will report how it affects my results.