JasonKessler / scattertext

Beautiful visualizations of how language differs among document types.
Apache License 2.0
2.23k stars 289 forks source link

Question about Phrase Association. #65

Closed fjubair closed 4 years ago

fjubair commented 4 years ago

Hello Jason,

I am visualizing phrase association (based on Handler et al. 2016) for a Twitter dataset. Below is my code: corpus = CorpusFromPandas( df, category_col='sentiment', text_col='parse', feats_from_spacy_doc=PhraseMachinePhrases(), nlp=spacy.load('en_core_web_sm', parser=False) ).build().compact(AssociationCompactor(4000))

I use produce_scattertext_explorer to generate an html page. However, top frequent phrases included emoji expressions that are partially similar. For example, the following phrases appeared:

🤣 🤣 🤣 🤣 🤣 🤣 🤣 🤣 🤣 🤣 🤣 😂 😂 😂 😂 😂 😂 😂 😂 😂 😂 😂 😂 😂 😂 😂 🤣 🤣 🤣 🤣 🤣 🤣 🤣

My question is, are these unrelated emoji expressions that came from unrelated tweets? or for example the five laughing emojis can actually be part of the six laughing emojis?

Thank you for your help.

JasonKessler commented 4 years ago

Phrase Machine identifies phrases using extraction patterns run on part-of-speech tags. This means that phrases are found in isolation, and therefore a phrase identified in one context may not be found in another.

This process leads to the unfortunate result you're seeing, where repeated sequences of the same emoji are treated as different phrases.

fjubair commented 4 years ago

Thank you very much for your reply