JasonKessler / scattertext

Beautiful visualizations of how language differs among document types.
Apache License 2.0
2.23k stars 289 forks source link

Extracting Charts From Scattertext Explorer #59

Closed fjubair closed 4 years ago

fjubair commented 4 years ago

I am using a scattertext explorer for a corpus with millions of tweets. The produced html by st.produce_scattertext_explorer is too big. Is there a way to extract/show only the chart inside the html? Thank you.

JasonKessler commented 4 years ago

Two things:

  1. In produce_scattertext_explorer (or whatever function you're using to generate the chart), pass in max_docs_per_category=0. That will make sure that no documents are included in the chart.

  2. Make sure that no more than 4,000 terms are being used to create the chart. The more terms you render, the more circles your browser has to render and the more labels Scattertext has to assign. One way to do this is to run your corpus through an AssociationCompactor, as in the example at the top of the Readme.

Below, only the 2000 words which are most associated with either category are kept.

corpus = st.CorpusFromParsedDocuments(
    df, category_col='party', parsed_col='parse'
).build().get_unigram_corpus().compact(st.AssociationCompactor(2000))
fjubair commented 4 years ago

Thank you very much for your help

fjubair commented 4 years ago

Hello Jason,

What about Empath Visualization corpus? For example, in the below code, adding AssociationCompactor would give me an error. Is using AssociationCompactor not actually needed in this case? Again, my intention is to make sure the generated html file by the produce_scattertext_explorer is very small so that it can be rendered by the browser. Thank you very much.

empath_corpus = st.CorpusFromParsedDocuments(convention_df, ... category_col='party', ... feats_from_spacy_doc=feat_builder, ... parsed_col='text').build()

On Mon, May 4, 2020 at 1:10 AM Jason S. Kessler notifications@github.com wrote:

Closed #59 https://github.com/JasonKessler/scattertext/issues/59.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JasonKessler/scattertext/issues/59#event-3297890328, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMEIIHEAUEPZWPPRN2J7EGTRPXTU7ANCNFSM4MXWAOZA .

JasonKessler commented 4 years ago

Hmm. I’ll have to look into why the compactor’s not working with empath, but it shouldn’t be necessary. Just make sure your max_docs_per_category is 0