Closed dafajon closed 2 years ago
Expected performance has been achieved. Tokenization times (s) (on extended raw dataset, ICU tokenizer) Old: 790 New: 38
TF-IDF generation time (s): Old: 1074 New: 176
No issues have been found, except having two private variables _sents and _sentences (might be confusing for later on?) EDIT: It might be preferable to keep it this way, to keep backward compatibility.
Excited !!! Will be testing and merging asap Hüsnü Şensoy / VLDB Expert @.***
[image: Global Maksimum Data & Information Tech]
Global Maksimum Data & Information Tech +902162506637 / +902162506600 Acıbadem Mah. Çeçen Sk. Akasya Kule A-3 No:25 Kat:14 34660 Üsküdar, İstanbul Türkiye
[image: LinkedIn] https://htmlsig.com/t/0000001BRGEK6 [image: Instagram] https://htmlsig.com/t/000001DZYDYM [image: Github] https://htmlsig.com/t/000001DDAY0N
On Tue, Oct 26, 2021 at 10:59 PM Askar Bozcan @.***> wrote:
Expected performance has been achieved. Tokenization times (s) (on extended raw dataset, ICU tokenizer) Old: 790 New: 38
TF-IDF generation time (s): Old: 1074 New: 176
No issues have been found, except having two private variables _sents and _sentences (might be confusing for later on?)
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/GlobalMaksimum/sadedegel/pull/299#issuecomment-952274425, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJVFRWJCDFE46KFH6ADPB3UI4JDTANCNFSM5GT6GCDA .
DocBuilder
__call__
performs sentence splitting usingsbd
. This slows down tokenization and vectorization of documents which does not require sentence splitting beforehand.cached_propery
method ofDocument
. Whenever aSentence
orSpan
list is called with overloaded methods, access the property._spans
withspans
cached_property
method. Renameself.spans
toself._spans
accordingly._sents
attribute to_sentences
.