JasonKessler / scattertext

Beautiful visualizations of how language differs among document types.
Apache License 2.0
2.23k stars 289 forks source link

Chinese scattertext #55

Open sound118 opened 4 years ago

sound118 commented 4 years ago

Your Environment

It seems in your demo code, developer can directly use "chinese_nlp" module from scattertext package. I am wondering for plotting Chinese scatter text, if we could add a list of user defined stopwords and probably some user-defined dictionary specific for certain Chinese context, then use jieba to do the word segmentation and tie all these cleaned results to your demo program?

Thanks

JasonKessler commented 4 years ago

You could stop list after tokenization by running corpus.remove_terms(...). Otherwise, feel free to modify AsianNLP.py to fit your use case. It just ducktypes spaCy’s interface.