JasonKessler / scattertext

Beautiful visualizations of how language differs among document types.
Apache License 2.0
2.23k stars 289 forks source link

[Feature Request] Adding more control during `nlp` parsing #56

Closed XiaomoWu closed 4 years ago

XiaomoWu commented 4 years ago

Hi There, thank you for the amazing tool. Is it possible that we can control which terms are included in the computation of F-score? For example, only terms with a "LOC entity" attribute (which can be obtained from spaCy's NER tagger) are computed.

If that's a feature that can be achieved with the current version, could you please show me how? Thanks.

JasonKessler commented 4 years ago

I've added in a class to help extract spaCy named entities of particular types called SpacyEntities. If you'd want some additional types of features or a custom set of named entities, please just subclass FeatsFromSpacyDoc.

Below is an example us extracting NAME and LOC entities and plotting them.

import scattertext as st
import spacy

nlp = spacy.load('en')

df = st.SampleCorpora.ConventionData2012.get_data().assign(
    parse=lambda df: list(nlp.pipe(df.text))
)

corpus = st.CorpusFromParsedDocuments(
    df,
    category_col='party',
    parsed_col='parse',
    feats_from_spacy_doc=st.SpacyEntities(entity_types_to_use=['NAME', 'LOC'])
).build()

html = st.produce_scattertext_explorer(
    corpus,
    category='democrat',
    category_name='Democratic',
    not_category_name='Republican',
    minimum_term_frequency=0, pmi_threshold_coefficient=0,
    width_in_pixels=1000, metadata=corpus.get_df()['speaker'],
    transform=st.Scalers.dense_rank,
    max_overlapping=10
)
open('./demo_names.html', 'w').write(html)
print('open ./demo_names.html in Chrome')