UChicago-CCA-2021 / Frequently-Asked-Questions

Repository to ask questions - please use the issues page to ask your questions.
0 stars 0 forks source link

TypeError: object of type 'float' has no len() when run lucem_illud.word_tokenize(s) and lucem_illud.sent_tokenize(x) #31

Closed jinfei1125 closed 3 years ago

jinfei1125 commented 3 years ago

Hi, when I try to tokenize my corpus followed the example codes, I got an unexpected TypeError. Everything is the same except for the data frame name. The data structure of pf_df['text'] column is String in my data frame like the senReleasesDF['text']. I really don't understand my code run into this error...

Sample codes:

#Apply our functions, notice each row is a list of lists now
senReleasesDF['tokenized_sents'] = senReleasesDF['text'].apply(lambda x: [lucem_illud.word_tokenize(s) for s in lucem_illud.sent_tokenize(x)])
senReleasesDF['normalized_sents'] = senReleasesDF['tokenized_sents'].apply(lambda x: [lucem_illud.normalizeTokens(s) for s in x])

My codes(to avoid unnecessary typo error I just copy-paste the code and change the data frame name):

pf_df['tokenized_sents'] = pf_df['text'].apply(lambda x: [lucem_illud.word_tokenize(s) for s in lucem_illud.sent_tokenize(x)])
pf_df['normalized_sents'] = pf_df['tokenized_sents'].apply(lambda x: [lucem_illud.normalizeTokens(s) for s in x])

When looking into the content of the ['text'] column, they are in the same data structure.

image

image

I also try to manually apply the function instead of the dataframe.apply() method and lambda function. It works in fact. But what's wrong with the code above? image

The error message I got:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-84-6d3ca6599923> in <module>
      1 #Apply our functions, notice each row is a list of lists now
----> 2 pf_df['tokenized_sents'] = pf_df['text'].apply(lambda x: [lucem_illud.word_tokenize(s) for s in lucem_illud.sent_tokenize(x)])
      3 #senReleasesDF['normalized_sents'] = senReleasesDF['tokenized_sents'].apply(lambda x: [lucem_illud.normalizeTokens(s, lemma=False) for s in x])
      4 pf_df['normalized_sents'] = pf_df['tokenized_sents'].apply(lambda x: [lucem_illud.normalizeTokens(s) for s in x])
      5 

~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
   4106             else:
   4107                 values = self.astype(object)._values
-> 4108                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   4109 
   4110         if len(mapped) and isinstance(mapped[0], Series):

pandas\_libs\lib.pyx in pandas._libs.lib.map_infer()

<ipython-input-84-6d3ca6599923> in <lambda>(x)
      1 #Apply our functions, notice each row is a list of lists now
----> 2 pf_df['tokenized_sents'] = pf_df['text'].apply(lambda x: [lucem_illud.word_tokenize(s) for s in lucem_illud.sent_tokenize(x)])
      3 #senReleasesDF['normalized_sents'] = senReleasesDF['tokenized_sents'].apply(lambda x: [lucem_illud.normalizeTokens(s, lemma=False) for s in x])
      4 pf_df['normalized_sents'] = pf_df['tokenized_sents'].apply(lambda x: [lucem_illud.normalizeTokens(s) for s in x])
      5 

~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\lucem_illud\proccessing.py in sent_tokenize(word_list, model)
     83 
     84 def sent_tokenize(word_list, model=nlp):
---> 85     doc = model(word_list)
     86     sentences = [sent.string.strip() for sent in doc.sents]
     87     return sentences

~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\spacy\language.py in __call__(self, text, disable, component_cfg)
    435         DOCS: https://spacy.io/api/language#call
    436         """
--> 437         doc = self.make_doc(text)
    438         if component_cfg is None:
    439             component_cfg = {}

~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\spacy\language.py in make_doc(self, text)
    461 
    462     def make_doc(self, text):
--> 463         if len(text) > self.max_length:
    464             raise ValueError(
    465                 Errors.E088.format(length=len(text), max_length=self.max_length)

TypeError: object of type 'float' has no len()

Thanks for any help!

jinfei1125 commented 3 years ago

Problem solved... It turns out that there is nan in my df and dropna() saves my life. Thanks Xi Cheng for the advice!