JasonKessler / scattertext

Beautiful visualizations of how language differs among document types.
Apache License 2.0
2.23k stars 289 forks source link

empath visualisation doesn't work with non binary categories #31

Open swartchris8 opened 6 years ago

swartchris8 commented 6 years ago

Can't click on nodes in the empath visualisation to see the relevant text. Get the below error with diffrent property numbers when clicking on them and text is not rendered under the visualisation.

Browser error:

Billingpayment-Visualization.html:4484 Uncaught TypeError: Cannot read property '14' of undefined
    at searchInExtraFeatures (Billingpayment-Visualization.html:4484)
    at gatherTermContexts (Billingpayment-Visualization.html:4453)
    at SVGTextElement.<anonymous> (Billingpayment-Visualization.html:5027)
    at SVGTextElement.<anonymous> (d3.min.js:2)

Python code to generate visualisation:


import scattertext as st
from IPython.display import IFrame

convention_df = st.SampleCorpora.ConventionData2012.get_data()
convention_df["party"].iloc[3] = "liberal"
convention_df["party"].iloc[4] = "republican"
convention_df["party"].iloc[5] = "liberal"
convention_df["party"].iloc[6] = "republican"

empath_corpus = st.CorpusFromParsedDocuments(convention_df.iloc[:15],
                                             category_col="party",
                                             feats_from_spacy_doc=st.FeatsFromOnlyEmpath(),
                                             parsed_col="text").build()

html = st.produce_scattertext_explorer(empath_corpus,
    category = 'democrat',
    category_name = 'democrat',
    not_category_name = "Not democrat",
    width_in_pixels=1000,
    use_non_text_features=True,
    use_full_doc=True)

file_name = 'democrat.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)

Your Environment

swartchris8 commented 6 years ago

Seems like the issue isn't with the multiple categories just the empath visualisation following snippet with 2 categories still fails:

import scattertext as st
from IPython.display import IFrame

convention_df = st.SampleCorpora.ConventionData2012.get_data()
convention_df["party"].iloc[3] = "liberal"
convention_df["party"].iloc[4] = "republican"
convention_df["party"].iloc[5] = "liberal"
convention_df["party"].iloc[6] = "republican"
convention_df[convention_df["party"] != "democrat"]["party"] = "not democrat"

empath_corpus = st.CorpusFromParsedDocuments(convention_df[:14],
                                             category_col="party",
                                             feats_from_spacy_doc=st.FeatsFromOnlyEmpath(),
                                             parsed_col="text").build()

html = st.produce_scattertext_explorer(empath_corpus,
    category = 'democrat',
    category_name = 'democrat',
    not_category_name = "Not democrat",
    width_in_pixels=1000,
    use_non_text_features=True,
    use_full_doc=True)

file_name = 'democrat.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)
JasonKessler commented 6 years ago

Thanks for the bug report.

I just made some significant improvements to the topic modeling component in Scattertext. You can not only view documents that match an empath category, but if you add

topic_model_term_lists=st.FeatsFromOnlyEmpath().get_top_model_term_lists()

as a parameter to produce_scattertext_explorer, it will bold the terms associated with the empath category. Please see https://github.com/JasonKessler/scattertext#visualizing-topic-models for more information.