JasonKessler / scattertext

Beautiful visualizations of how language differs among document types.
Apache License 2.0
2.23k stars 287 forks source link

Labeling Issue during visualization of Scaled F-Score #128

Closed fatihbozdag closed 1 year ago

fatihbozdag commented 1 year ago

Greetings all,

I've applied scaled F-Scroe for terms associations, however, output shows 'Top NONE' rather than the actual name of the category. Yet, no issue with 'not category' with f-scaled or visualization through dispersion and frequency.

def create_corpus(df, category_col, text_col, nlp):
    return (
        st.CorpusFromPandas(df, category_col=category_col, text_col=text_col, nlp=nlp)
        .build()
        .remove_terms(_stop_words.ENGLISH_STOP_WORDS, ignore_absences=True)
        .get_unigram_corpus()
        .compact(st.AssociationCompactor(2000))
    )

corpus = create_corpus(documents, 'Topic', 'text_field', nlp)

dispersion = st.Dispersion(corpus)

dispersion_df = dispersion.get_df()

html = st.produce_scattertext_explorer(corpus,
...          category='Common_Topic',
...          category_name='Common Topic Terms',
...          not_category_name='Other Topics',
...          width_in_pixels=1000,
...          metadata=documents['Native_Language'],
             jitter=0.1,
             minimum_term_frequency=5,
             transform=st.Scalers.percentile)
open("EFL_Learners_Visualization.html", 'wb').write(html.encode('utf-8'))
Screenshot 2023-04-21 at 02 58 03

Here the labeling is correct on upper right saying Common Topic Terms, However;

from scipy.stats import hmean
term_freq_df = corpus.get_unigram_corpus().get_term_freq_df()[['Common_Topic freq', 'Other_Topics freq']]
term_freq_df = term_freq_df[term_freq_df.sum(axis=1) > 0]
term_freq_df['pos_precision'] = (term_freq_df['Common_Topic freq'] * 1. /
                                 (term_freq_df['Common_Topic freq'] + term_freq_df['Other_Topics freq']))
term_freq_df['pos_freq_pct'] = (term_freq_df['Common_Topic freq'] * 1.
                                / term_freq_df['Common_Topic freq'].sum())
term_freq_df['pos_hmean'] = (term_freq_df
                             .apply(lambda x: (hmean([x['pos_precision'], x['pos_freq_pct']])
                                               if x['pos_precision'] > 0 and x['pos_freq_pct'] > 0
                                               else 0), axis=1))
freq = term_freq_df.pos_freq_pct.values
prec = term_freq_df.pos_precision.values
html = st.produce_scattertext_explorer(
    corpus.remove_terms(set(corpus.get_terms()) - set(term_freq_df.index)),
    category='Common_Topic',
    not_category_name='Other_Topics',
    not_categories=['Other_Topics'],

    x_label='Portion of words used in common topic',
    original_x=freq,
    x_coords=(freq - freq.min()) / freq.max(),
    x_axis_values=[int(freq.min() * 1000) / 1000.,
                   int(freq.max() * 1000) / 1000.],

    y_label='Portion of documents containing words that are included in Common Topics',
    original_y=prec,
    y_coords=(prec - prec.min()) / prec.max(),
    y_axis_values=[int(prec.min() * 1000) / 1000.,
                   int((prec.max() / 2.) * 1000) / 1000.,
                   int(prec.max() * 1000) / 1000.],
    scores=term_freq_df.pos_hmean.values,

    sort_by_dist=False,
    show_characteristic=False
)
open("EFL_Learners_Visualization_F_Score.html", 'wb').write(html.encode('utf-8'))
Screenshot 2023-04-21 at 03 03 21

It says 'Top None' in the upper right corner. What is it I am missing or doing wrong?

JasonKessler commented 1 year ago

You need to set category_name = 'Common_topic' in produce_scattertext_explorer

fatihbozdag commented 1 year ago

You need to set category_name = 'Common_topic' in produce_scattertext_explorer

Sorry! How silly I am, I've missed it. I was following the sample code on the page, (here)! I've got it.