JasonKessler / scattertext

Beautiful visualizations of how language differs among document types.
Apache License 2.0
2.23k stars 289 forks source link

return_data parameter for produce_scattertext_explorer #63

Closed splevine closed 4 years ago

splevine commented 4 years ago

Thanks for a great tool!

I noticed an output that I wanted to bring to your attention. Following along with your first example if you add return_data=True parameter to produce_scattertext_explorer:

scatter_text_dict = st.produce_scattertext_explorer(
    corpus,
    category='democrat',
    category_name='Democratic',
    not_category_name='Republican',
    metadata=convention_df['speaker'],
        return_data=True
)

scatter_text_dict is created with keys ['info','data','docs']

When examining scatter_text_dict['info'] the value is:

{'categories': ['democrat', 'republican'],
 'category_internal_name': 'democrat',
 'category_name': 'Democratic',
 'category_terms': ['government',
  'business',
  'better',
  'story',
  'paul',
  'success',
  'administration',
  'unemployment',
  'we need',
  'do better'],
 'extra_category_internal_names': [],
 'extra_category_name': 'Extra',
 'neutral_category_internal_names': [],
 'neutral_category_name': 'Neutral',
 'not_category_internal_names': ['republican'],
 'not_category_name': 'Republican',
 'not_category_terms': ['government',
  'business',
  'better',
  'story',
  'paul',
  'success',
  'administration',
  'unemployment',
  'we need',
  'do better']}

You can see the category_terms and non_category_terms values are the same. Both lists are the list of top terms for non_category_terms.

There is similar behavior for other examples (e.g., empath_features, etc.)

Your Environment

JasonKessler commented 4 years ago

Thanks a lot for the bug report. Those lists are are vestigial, but version 0.0.2.66 contains and update which should make sure these lists contain the correct terms.