JasonKessler / scattertext

Beautiful visualizations of how language differs among document types.
Apache License 2.0
2.23k stars 287 forks source link

Simple Example uses non-existent PMI argument #122

Open polm opened 1 year ago

polm commented 1 year ago

Thanks for working on this package. I updating the entry in the spaCy Universe (https://github.com/explosion/spaCy/pull/11937#pullrequestreview-1208010525) and we noticed the sample here uses an argument that doesn't seem to work with the latest release.

https://github.com/JasonKessler/scattertext/blob/8ddff82f670aa2ed40312b2cdd077e7f0a98a873/simple.py#L19

JasonKessler commented 1 year ago

Thanks for pointing this out and including Scattertext in the spaCy universe. I'm preparing to deprecate the produce_scattertext_html function, and I think it would be best if the spaCy Universe page included an example of Scattertext usage which involved more of the features available and renders a more interactive UI. For example:

import scattertext as st
import spacy

nlp = spacy.blank('en')
nlp.add_pipe('sentencizer')

df = st.SampleCorpora.ConventionData2012.get_data().assign(
    parse=lambda df: df.text.apply(nlp)
)

corpus = st.CorpusFromParsedDocuments(
    df, 
    category_col='party', 
    parsed_col='parse'
).build().get_stoplisted_unigram_corpus().compact(st.AssociationCompactor(2000))

html = st.produce_scattertext_explorer(
    corpus,
    category='democrat', 
    category_name='Democratic', 
    not_category_name='Republican',
    minimum_term_frequency=0, 
    pmi_threshold_coefficient=0,
    width_in_pixels=1000, 
    metadata=lambda corpus: corpus.get_df()['speaker'],
    transform=st.Scalers.dense_rank
)
with open('./demo_compact.html', 'w') as of:
    of.write(html)

Regardless, I'll update the package to ensure the pmi_filter_thresold argument still works.

polm commented 1 year ago

Ah, thanks for the info about the example! We've already merged the PR I linked to, but if you'd like to update the Universe entry we'd be happy to look at a PR any time. (That said, we're currently working on our website backend, so any updates in the immediate future won't go live for a bit.)