JasonKessler / Scattertext-PyData

Notebooks for the Seattle PyData 2017 talk on Scattertext
139 stars 52 forks source link

Dimensionality reduction #2

Open ebaggott opened 6 years ago

ebaggott commented 6 years ago

This is great! How can one incorporate dimensionality reduction into the pipeline? For substantive and speed reasons, I'd like to exclude the most and least common words:

corpus = st.CorpusFromPandas(df, category_col='country', text_col='text', nlp=nlp,

can we discard 1st and 99th percentile of words here?

                         ).build()
JasonKessler commented 6 years ago

Thanks!

Right now there's no way to exclude terms during corpus construction. However, after the corpus is constructed, you can easily remove outlying terms. For example:

# Remove bigrams from corpus.
corpus = corpus.get_unigram_corpus() 

# Create a pandas Series indexed on words containing their frequencies
term_frequencies = corpus.get_term_freq_df().sum(axis=1)

# Get the terms in the 99th and 1st percentiles
terms_99th_pctl = term_frequencies[term_frequencies >= np.percentile(term_frequencies, 99)].index
terms_1st_pctl = term_frequencies[term_frequencies <= np.percentile(term_frequencies, 1)].index

# Remove them from the corpus
reduced_corpus = corpus.remove_terms(terms_99th_pctl | terms_1st_pctl)

Hope this helps!