Dimensionality reduction

Thanks!

Right now there's no way to exclude terms during corpus construction. However, after the corpus is constructed, you can easily remove outlying terms. For example:

# Remove bigrams from corpus.
corpus = corpus.get_unigram_corpus() 

# Create a pandas Series indexed on words containing their frequencies
term_frequencies = corpus.get_term_freq_df().sum(axis=1)

# Get the terms in the 99th and 1st percentiles
terms_99th_pctl = term_frequencies[term_frequencies >= np.percentile(term_frequencies, 99)].index
terms_1st_pctl = term_frequencies[term_frequencies <= np.percentile(term_frequencies, 1)].index

# Remove them from the corpus
reduced_corpus = corpus.remove_terms(terms_99th_pctl | terms_1st_pctl)

Hope this helps!

JasonKessler / Scattertext-PyData

Dimensionality reduction #2

can we discard 1st and 99th percentile of words here?