JuliaText / TextAnalysis.jl

Julia package for text analysis
Other
374 stars 96 forks source link

Restricting dtm/tf_idf creation to only the top N features from the lexicon #71

Open pazzo83 opened 6 years ago

pazzo83 commented 6 years ago

I am working with a corpus of 100k+ documents, so the number of features in the lexicon is extremely high. Thus I'm running into memory issues and the like. I know in scipy's TfidfVectorizer and similar approaches, you can limit the number of features such that you only are dealing with the top N features when working with the dtm and tf_idf matrices. Is there some way to do that with this package, or are there plans to add such a feature?

Thanks!

aviks commented 6 years ago

Not yet, but might be worth adding.