JasonKessler / scattertext

Beautiful visualizations of how language differs among document types.
Apache License 2.0
2.22k stars 287 forks source link

how can I use the original text in the snippets after cleaning? #54

Open mikkokotila opened 4 years ago

mikkokotila commented 4 years ago

Once I've removed stopwords using nltk or similar, I want to be able to see the original text snippets and not the ones without stopwords. How can I achieve that?

JasonKessler commented 4 years ago

The preferred way to remove stopwords in Scattertext is to pass the full documents into a Corpus factory, and then use the Corpus.remove_terms method to create a corpus free of stopwords. You'll still let be able to view the original documents in the scattertext explorer.

For example:

convention_df = st.SampleCorpora.ConventionData2012.get_data().assign(
    parse = lambda df: df.text.apply(st.whitespace_nlp_with_sentences)
)
corpus = st.CorpusFromParsedDocuments(convention_df, category_col='party', parsed_col='parse').build()

stoplisted_corpus = corpus.remove_terms(['a', 'the'])

On the other hand, you could pass in an alternate_text_field parameter into produce_scattertext_explorer or another compatible function. This would be a column name in the data frame used to create the corpus which would be searched and displayed in the Scattertext visualization. However, the alternative text field is not used to make the plot itself.

sound118 commented 4 years ago

The preferred way to remove stopwords in Scattertext is to pass the full documents into a Corpus factory, and then use the Corpus.remove_terms method to create a corpus free of stopwords. You'll still let be able to view the original documents in the scattertext explorer.

For example:

convention_df = st.SampleCorpora.ConventionData2012.get_data().assign(
  parse = lambda df: df.text.apply(st.whitespace_nlp_with_sentences)
)
corpus = st.CorpusFromParsedDocuments(convention_df, category_col='party', parsed_col='parse').build()

stoplisted_corpus = corpus.remove_terms(['a', 'the'])

On the other hand, you could pass in an alternate_text_field parameter into produce_scattertext_explorer or another compatible function. This would be a column name in the data frame used to create the corpus which would be searched and displayed in the Scattertext visualization. However, the alternative text field is not used to make the plot itself.

I am actually facing the same issue as mentioned above. I am doing a scattertext plot for Chinese and I followed your instructions above by passing an alternate_text_field parameter into produce_scattertext_explorer. When I click a term in scattertext plot, no origianl text shows up in the snippets. Actually, it shows nothing in the snippets. How do I make the original text to be shown there?

JasonKessler commented 4 years ago

Could you upload the example which fails to show snippets?

sound118 commented 4 years ago

@JasonKessler I just uploaded the example that can reproduce the issue, please see https://github.com/sound118/Scatter-text-for-Chinese

I used "jieba" package to remove stopwords list and load user-defined dictionary in case of any wrong Chinese term segmentation, applied your "chinese_nlp" afterwards. df['parsed_text'] = df['parsed_text'].apply(chinese_nlp)

You can change the file path to run the program on your lcoal machine to find out the issue.

Thanks.

JasonKessler commented 4 years ago

I think the issue is that the alternative text field has to be whitespace-tokenized for the matcher to work.

sound118 commented 4 years ago

@JasonKessler, thanks for the hint. It works after adding df['text'] = df['text'].apply(chinese_nlp) in the uploaded program. At least, it's still readable after whitespace-tokenized the alternative text field as apposed to the parsed documents in Chinese. It will be even better being able to support the original Chinese documents in the snippet if some features could be added in your scattertext package. Nevertheless, it's elegant enough.

JasonKessler commented 4 years ago

Glad to hear it works.

It would be a good feature for someone in the community to pick up and build.

MastafaF commented 3 years ago

Hi @JasonKessler ,

I have the same issue here and I could not solve it with your suggestion. Basically, the following code is used:


data = data.loc[:, ['id', 'language', 'ProcessedText', 'OriginalText']]

from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline

data['parse'] = data['ProcessedText'].apply(st.whitespace_nlp_with_sentences)

unigram_corpus = (st.CorpusFromParsedDocuments(data,
                                               category_col='language',
                                               parsed_col='parse')
                  .build().get_stoplisted_unigram_corpus())

html = st.produce_scattertext_explorer(
            unigram_corpus,
            category='French', category_name='French', not_category_name='German',
            minimum_term_frequency=0, pmi_threshold_coefficient=0,
            width_in_pixels=1000, metadata=unigram_corpus.get_df()['language'],
            alternative_text_field = 'OriginalText' ,
            transform=st.Scalers.dense_rank
)

What I expect is to see fully the text from the OriginalText column after clicking on a given word in the chart. However at the moment I only see a chunk of such text.

For example, when clicking on the word 'thank', I would see something like the following:

Thank you! 

When I expect to see the following instead:

This was a great moment. Thank you!

Basically, I do not want the chunking when search for a given word among my text column. Can we achieve that? 😄

JasonKessler commented 3 years ago

Try adding use_full_doc=True as an argument to produce_scattertext_explorer. If that doesn't work, could you please post an independently runnable example which demonstrates the problem?

MastafaF commented 3 years ago

Works great @JasonKessler! Thanks 😄