JasonKessler / scattertext

Beautiful visualizations of how language differs among document types.
Apache License 2.0
2.23k stars 287 forks source link

AttributeError in plotting scattertext #101

Closed states786 closed 3 years ago

states786 commented 3 years ago

Hi,

I am in learning scattertext. I have a dataset of English documents which are categorized and labeled as D1, D2, D3, etc. One category such as D1 can have multiple documents. So, following is the sketch of dataset:

Category; Text D1; abc sdf....... D1; jhs dgf.... D2; sdf dfh..... . . . . . . DN; xyz jha....

Now, I would like to plot the corpus content in terms of scattertext. But, when I run the following code, I am getting an error:

{AttributeError}("'numpy.str_' object has no attribute 'sents'", 'occurred at index 0')

import scattertext as st
import pandas as pd
import numpy as np

def readFile(filename):
   text_file = open(filename, 'r',encoding="utf8")
   data = text_file.read()
   corpuslist = data.split("\n")
   return corpuslist

category=readFile('categoryCorpus.txt')
textCorpus=readFile('textCorpus.txt')

categoryNp = np.array(category)
textCorpusNp = np.array(textCorpus)

fdf = pd.DataFrame({'category': categoryNp, 'text': textCorpusNp}, columns=['category', 'text'])

fcorpus = st.CorpusFromParsedDocuments(
   fdf, category_col='category', parsed_col='text'
).build().get_unigram_corpus().compact(st.AssociationCompactor(2000))

fhtml = st.produce_scattertext_explorer(
   fcorpus,
   minimum_term_frequency=0, pmi_threshold_coefficient=0,
   width_in_pixels=1000,
   transform=st.Scalers.dense_rank
)

open('demo_compact.html', 'w').write(html)

Please guide me how can I resolve it?

JasonKessler commented 3 years ago

The parsed_col argument needs to be a column in the data frame which contains spaCy Doc objects or something equivalent. Please refer to the first example in the readme.