I am working with a corpus of 100k+ documents, so the number of features in the lexicon is extremely high. Thus I'm running into memory issues and the like. I know in scipy's TfidfVectorizer and similar approaches, you can limit the number of features such that you only are dealing with the top N features when working with the dtm and tf_idf matrices. Is there some way to do that with this package, or are there plans to add such a feature?
I am working with a corpus of 100k+ documents, so the number of features in the lexicon is extremely high. Thus I'm running into memory issues and the like. I know in scipy's TfidfVectorizer and similar approaches, you can limit the number of features such that you only are dealing with the top N features when working with the dtm and tf_idf matrices. Is there some way to do that with this package, or are there plans to add such a feature?
Thanks!