JasonKessler / scattertext

Beautiful visualizations of how language differs among document types.
Apache License 2.0
2.23k stars 289 forks source link

issues with my data #52

Closed vaidyan5 closed 4 years ago

vaidyan5 commented 4 years ago
1 print(list(corpus.get_scaled_f_scores_vs_background().index[:10])) ~/anaconda3/lib/python3.7/site-packages/scattertext/TermDocMatrix.py in get_scaled_f_scores_vs_background(self, scaler_algo, beta) 922 pd.DataFrame of scaled_f_score scores compared to background corpus 923 ''' --> 924 df = self.get_term_and_background_counts() 925 df['Scaled f-score'] = ScaledFScore.get_scores_for_category( 926 df['corpus'], df['background'], scaler_algo, beta ~/anaconda3/lib/python3.7/site-packages/scattertext/TermDocMatrix.py in get_term_and_background_counts(self) 879 ''' 880 background_df = self._get_background_unigram_frequencies() --> 881 term_freq_df = self.get_term_freq_df() 882 corpus_freq_df = pd.DataFrame({'corpus': term_freq_df.sum(axis=1)}) 883 corpus_unigram_freq = self._get_corpus_unigram_freq(corpus_freq_df) ~/anaconda3/lib/python3.7/site-packages/scattertext/TermDocMatrix.py in get_term_freq_df(self, label_append) 160 return pd.DataFrame(mat, 161 index=pd.Series(self.get_terms(), name='term'), --> 162 columns=[c + label_append for c in self.get_categories()]) 163 164 def get_term_freq_mat(self): ~/anaconda3/lib/python3.7/site-packages/scattertext/TermDocMatrix.py in (.0) 160 return pd.DataFrame(mat, 161 index=pd.Series(self.get_terms(), name='term'), --> 162 columns=[c + label_append for c in self.get_categories()]) 163 164 def get_term_freq_mat(self): TypeError: unsupported operand type(s) for +: 'int' and 'str' If you're seeing an HTML error, please upload a screenshot, and any Javascript errors you're receiving. If possible then please upload the full HTML file if possible. The more infor -->

Your Environment

  • Operating System: Mac OS Catalina
  • Python Version Used:3.7
  • Scattertext Version Used: scattertext==0.0.2.28
  • Environment Information: jupytemplate==0.3.0 jupyter==1.0.0 jupyter-client==5.3.1 jupyter-console==6.0.0 jupyter-contrib-core==0.3.3 jupyter-contrib-nbextensions==0.5.1 jupyter-core==4.5.0 jupyter-highlight-selected-word==0.2.0 jupyter-latex-envs==1.4.6 jupyter-nbextensions-configurator==0.4.1 jupyterlab==1.0.2 jupyterlab-server==1.0.0

df_test.csv.zip

JasonKessler commented 4 years ago

Thanks for linking to your data. What’s needed is a code snippet that I can on the latest version of Scattertext (0.0.2.59) that reproduces this error

vaidyan5 commented 4 years ago

I believe that I had included the code snippet with the github post

On Sat, Mar 14, 2020 at 4:09 PM Jason S. Kessler notifications@github.com wrote:

Thanks for linking to your data. What’s needed is a code snippet that I can on the latest version of Scattertext (0.0.2.59) that reproduces this error

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JasonKessler/scattertext/issues/52#issuecomment-599129376, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJD6M7GNHQTYUUMZBX5E5BDRHPP65ANCNFSM4LHG7IWQ .

vaidyan5 commented 4 years ago

Jason - here is the code snippet that caused the error

Thanks...

import scattertext as st import spacy from pprint import pprint nlp = spacy.load('en')

corpus = st.CorpusFromPandas(df_test,category_col='loc_k',text_col='message',nlp=nlp).build() print(list(corpus.get_scaled_f_scores_vs_background().index[:10]))

vaidyan5 commented 4 years ago

Just posted the code snippet..

Thanks

On Sat, Mar 14, 2020 at 4:10 PM Nanda Nathan vaidyan5nanda5@gmail.com wrote:

I believe that I had included the code snippet with the github post

On Sat, Mar 14, 2020 at 4:09 PM Jason S. Kessler notifications@github.com wrote:

Thanks for linking to your data. What’s needed is a code snippet that I can on the latest version of Scattertext (0.0.2.59) that reproduces this error

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JasonKessler/scattertext/issues/52#issuecomment-599129376, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJD6M7GNHQTYUUMZBX5E5BDRHPP65ANCNFSM4LHG7IWQ .

JasonKessler commented 4 years ago

Thanks! Could you please include the line(s) which create df_test?

vaidyan5 commented 4 years ago

Here you go:

df_test.to_csv("df_test.csv", index = False)

vaidyan5 commented 4 years ago

Just posted the line Thanks

On Sat, Mar 14, 2020 at 5:21 PM Jason S. Kessler notifications@github.com wrote:

Thanks! Could you please include the line(s) which create df_test?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JasonKessler/scattertext/issues/52#issuecomment-599136260, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJD6M7FJKZ2YTR7RV4WGHPLRHPYNVANCNFSM4LHG7IWQ .

JasonKessler commented 4 years ago

Are you sure that’s how you’re generating it?

vaidyan5 commented 4 years ago

Yes - this is how the file was generated.

Are there leading empty spaces that are causing an issue ??

Nanda

On Sat, Mar 14, 2020 at 5:48 PM Jason S. Kessler notifications@github.com wrote:

Are you sure that’s how you’re generating it?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JasonKessler/scattertext/issues/52#issuecomment-599138806, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJD6M7HJVLT2PGRRZZW2723RHP3S7ANCNFSM4LHG7IWQ .

JasonKessler commented 4 years ago

How are you reading in the data frame from the csv file?

vaidyan5 commented 4 years ago

Or is there an issue with any string and integer casting etc.??

On Sat, Mar 14, 2020 at 5:52 PM Nanda Nathan vaidyan5nanda5@gmail.com wrote:

Yes - this is how the file was generated.

Are there leading empty spaces that are causing an issue ??

Nanda

On Sat, Mar 14, 2020 at 5:48 PM Jason S. Kessler notifications@github.com wrote:

Are you sure that’s how you’re generating it?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JasonKessler/scattertext/issues/52#issuecomment-599138806, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJD6M7HJVLT2PGRRZZW2723RHP3S7ANCNFSM4LHG7IWQ .

vaidyan5 commented 4 years ago

Using the foll: df = pd.read_csv(“file_name”)

On Sat, Mar 14, 2020 at 5:53 PM Jason S. Kessler notifications@github.com wrote:

How are you reading in the data frame from the csv file?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JasonKessler/scattertext/issues/52#issuecomment-599139492, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJD6M7BLWCIPC3AMAET4TMTRHP4DZANCNFSM4LHG7IWQ .

vaidyan5 commented 4 years ago

On Sat, Mar 14, 2020 at 5:55 PM Nanda Nathan vaidyan5nanda5@gmail.com wrote:

Using the foll: df = pd.read_csv(“file_name”)

On Sat, Mar 14, 2020 at 5:53 PM Jason S. Kessler notifications@github.com wrote:

How are you reading in the data frame from the csv file?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JasonKessler/scattertext/issues/52#issuecomment-599139492, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJD6M7BLWCIPC3AMAET4TMTRHP4DZANCNFSM4LHG7IWQ .

JasonKessler commented 4 years ago

Frankly, I just want to make sure that when I get to my computer, I have a snippet that I can copy and paste into python and reproduce the problem.

I have limited resources to address these problems, and I really don’t want to spend time figuring out what you did to cause an error.

vaidyan5 commented 4 years ago

Sure... That’s all I did. Thanks

Nanda

On Sat, Mar 14, 2020 at 6:06 PM Jason S. Kessler notifications@github.com wrote:

Frankly, I just want to make sure that when I get to my computer, I have a snippet that I can copy and paste into python and reproduce the problem.

I have limited resources to address these problems, and I really don’t want to spend time figuring out what you did to cause an error.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JasonKessler/scattertext/issues/52#issuecomment-599140680, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJD6M7EAKV2A7YDT2DGIOLLRHP5UTANCNFSM4LHG7IWQ .

JasonKessler commented 4 years ago

Could you run print(st.version)?

JasonKessler commented 4 years ago

This does work for me:

>>> st.__version__
'0.0.2.59'
>>> df_test = pd.read_csv('df_test.csv')
>>> corpus = st.CorpusFromPandas(df_test,category_col='loc_k',text_col='message',nlp=nlp).build()
>>> print(list(corpus.get_scaled_f_scores_vs_background().index[:10]))
['pahahahaha', 'lexxxxxiiiiii', 'lyanna', 'lmap', 'tonighy', 'draxxxxxxxx', 'lexxx', 'bestfriend', 'sellin', 'lmao']
vaidyan5 commented 4 years ago

Hi Jason print(at.version)

Output is [0 0 2 28]

Thanks

On Sat, Mar 14, 2020 at 6:34 PM Jason S. Kessler notifications@github.com wrote:

This does work for me:

st.version '0.0.2.59' df_test = pd.read_csv('df_test.csv') corpus = st.CorpusFromPandas(df_test,category_col='loc_k',text_col='message',nlp=nlp).build() print(list(corpus.get_scaled_f_scores_vs_background().index[:10])) ['pahahahaha', 'lexxxxxiiiiii', 'lyanna', 'lmap', 'tonighy', 'draxxxxxxxx', 'lexxx', 'bestfriend', 'sellin', 'lmao']

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JasonKessler/scattertext/issues/52#issuecomment-599143231, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJD6M7A7ZWWTQ4ETU7Q3KIDRHQBAPANCNFSM4LHG7IWQ .

JasonKessler commented 4 years ago

You need to upgrade to the latest version.

vaidyan5 commented 4 years ago

Thanks a ton for the help... Appreciate it...

Nanda.