JasonKessler / scattertext

Beautiful visualizations of how language differs among document types.
Apache License 2.0
2.23k stars 287 forks source link

What is the default scores used in the html created by scattertext.produce_scattertext_explorer function? #88

Closed AnnaDai1001 closed 3 years ago

AnnaDai1001 commented 3 years ago

Hello, I have been trying to figure out what are the default "scores" used when creating HTML file by scattertext.produce_scattertext_explorer function and read through the source code for hours but cant figure it out. Could anyone help me with this? Really appreciate it. I have created the html with the following piece of code:

        html = st.produce_scattertext_explorer(corpus,
                                                category='Positive', 
                                                category_name='Positive', 
                                                not_category_name='Negative',
                                                width_in_pixels=1000)

This means that I didn't specify the parameter scores so the default will be scores=None based on the source code. Plus, I didn't specify "term_scorer" either so the default will be term_scorer=None. From the source code of function produce_scattertext_explorer we have below. So I think scores will still be None. But in the HTML file, the terms are ranked by some scores. I am wondering what are these scores then? I have tried to calculate different metrics in each category, e.g. the frequency, f scaled score, pos precision etc. but none of them matched the HTML file.

    if term_scorer:
        scores = get_term_scorer_scores(category, corpus, neutral_categories, not_categories, show_neutral, term_ranker,
                                        term_scorer, use_non_text_features)

image Really appreciate your help!

JasonKessler commented 3 years ago

The RankDifference class is used. This part of the package is admittedly quite shambolic.

See https://nbviewer.jupyter.org/github/JasonKessler/PuPPyTalk/blob/master/notebooks/Class-Association-Scores.ipynb for an explanation of this metric.