Hi, when I try to tokenize my corpus followed the example codes, I got an unexpected TypeError. Everything is the same except for the data frame name. The data structure of pf_df['text'] column is String in my data frame like the senReleasesDF['text']. I really don't understand my code run into this error...
Sample codes:
#Apply our functions, notice each row is a list of lists now
senReleasesDF['tokenized_sents'] = senReleasesDF['text'].apply(lambda x: [lucem_illud.word_tokenize(s) for s in lucem_illud.sent_tokenize(x)])
senReleasesDF['normalized_sents'] = senReleasesDF['tokenized_sents'].apply(lambda x: [lucem_illud.normalizeTokens(s) for s in x])
My codes(to avoid unnecessary typo error I just copy-paste the code and change the data frame name):
pf_df['tokenized_sents'] = pf_df['text'].apply(lambda x: [lucem_illud.word_tokenize(s) for s in lucem_illud.sent_tokenize(x)])
pf_df['normalized_sents'] = pf_df['tokenized_sents'].apply(lambda x: [lucem_illud.normalizeTokens(s) for s in x])
When looking into the content of the ['text'] column, they are in the same data structure.
I also try to manually apply the function instead of the dataframe.apply() method and lambda function. It works in fact. But what's wrong with the code above?
The error message I got:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-84-6d3ca6599923> in <module>
1 #Apply our functions, notice each row is a list of lists now
----> 2 pf_df['tokenized_sents'] = pf_df['text'].apply(lambda x: [lucem_illud.word_tokenize(s) for s in lucem_illud.sent_tokenize(x)])
3 #senReleasesDF['normalized_sents'] = senReleasesDF['tokenized_sents'].apply(lambda x: [lucem_illud.normalizeTokens(s, lemma=False) for s in x])
4 pf_df['normalized_sents'] = pf_df['tokenized_sents'].apply(lambda x: [lucem_illud.normalizeTokens(s) for s in x])
5
~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
4106 else:
4107 values = self.astype(object)._values
-> 4108 mapped = lib.map_infer(values, f, convert=convert_dtype)
4109
4110 if len(mapped) and isinstance(mapped[0], Series):
pandas\_libs\lib.pyx in pandas._libs.lib.map_infer()
<ipython-input-84-6d3ca6599923> in <lambda>(x)
1 #Apply our functions, notice each row is a list of lists now
----> 2 pf_df['tokenized_sents'] = pf_df['text'].apply(lambda x: [lucem_illud.word_tokenize(s) for s in lucem_illud.sent_tokenize(x)])
3 #senReleasesDF['normalized_sents'] = senReleasesDF['tokenized_sents'].apply(lambda x: [lucem_illud.normalizeTokens(s, lemma=False) for s in x])
4 pf_df['normalized_sents'] = pf_df['tokenized_sents'].apply(lambda x: [lucem_illud.normalizeTokens(s) for s in x])
5
~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\lucem_illud\proccessing.py in sent_tokenize(word_list, model)
83
84 def sent_tokenize(word_list, model=nlp):
---> 85 doc = model(word_list)
86 sentences = [sent.string.strip() for sent in doc.sents]
87 return sentences
~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\spacy\language.py in __call__(self, text, disable, component_cfg)
435 DOCS: https://spacy.io/api/language#call
436 """
--> 437 doc = self.make_doc(text)
438 if component_cfg is None:
439 component_cfg = {}
~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\spacy\language.py in make_doc(self, text)
461
462 def make_doc(self, text):
--> 463 if len(text) > self.max_length:
464 raise ValueError(
465 Errors.E088.format(length=len(text), max_length=self.max_length)
TypeError: object of type 'float' has no len()
Hi, when I try to tokenize my corpus followed the example codes, I got an unexpected TypeError. Everything is the same except for the data frame name. The data structure of
pf_df['text']
column is String in my data frame like thesenReleasesDF
['text']. I really don't understand my code run into this error...Sample codes:
My codes(to avoid unnecessary typo error I just copy-paste the code and change the data frame name):
When looking into the content of the
['text']
column, they are in the same data structure.I also try to manually apply the function instead of the dataframe.apply() method and lambda function. It works in fact. But what's wrong with the code above?
The error message I got:
Thanks for any help!