estnltk / estnltk

Open source tools for Estonian natural language processing
GNU General Public License v2.0
113 stars 20 forks source link

Ner and Lemma failed with RuntimeError: CFSException: internal error with vabamorf #99

Open kyaes opened 5 years ago

kyaes commented 5 years ago

Hi,

we run ner and lemma from estnltk , with Hadoop and Spark and continuously getting the error RuntimeError: CFSException: internal error with vabamorf

Environment: Ubuntu 14.04.5 LTS (GNU/Linux 4.4.0-124-generic x86_64) Version used: estnltk in /usr/local/lib/python2.7/dist-packages (1.4.1.1) Spark1.6 and Spark2.1.0 (with YARN, Run on cluster from 3 servers) Python 2.7 Example of typical process: https://github.com/kyaes/Data-Loading/blob/master/NER_ERR.py When appear: Shuffle write data size approx >= 10GB of text. When read data from database with Spark (create dataframe) and apply ner and lemma on this dataframe columns.

Working well on small amount of data and using ner/lemma functions we can process max approx 10 000 rows, but on big amount of data >10 000 rows (i.e 13 000 000 rows), it starts to fail. Max text length of one column value can be approx 1000 till 1166355 chars. The process behavior: start to process data rows, after some short time the process failed with error vabamord, or interrupted socket connection. error: File "/usr/local/lib/python2.7/dist-packages/estnltk/vabamorf/morf.py", line 165, in analyze kwargs.get('propername', True)) RuntimeError: CFSException: internal error with vabamorf

Function used:

from estnltk import Text as esttext

def remove_names_ee(input_text): text = esttext(input_text) remove_char_indexes = []

for named_entity_span in text.named_entity_spans:
    remove_char_indexes.extend(range(named_entity_span[0], named_entity_span[1]))
input_text = "".join([char for i, char in enumerate(input_text) if i not in remove_char_indexes])
return input_text 

def get_names_ee(input_text): text = esttext(input_text) return unicode(", ".join(i for i in text.named_entities))

Same error appears for lemmatization:

def lemmatizer_ee2(input_text):

#input_text = str(input_text)
input_text = unicode(input_text)
est_text = esttext(input_text)
#print est_text.lemmas
lemma = []
words, types = est_text.get.lemmas.forms.as_list
for wordtype in zip(words, types):
    if wordtype[0] == 'olema' and wordtype[1] == 'neg o':
        lemma.append('pole')
    elif wordtype[0] == 'ei' and wordtype[1] == 'neg':
        lemma.append('ei')
    else:
        lemma.append(wordtype[0].split('|')[0])
#lemma = " ".join([ll for ll in lemma if (len(ll) > 2)])
lemma = " ".join([ll for ll in lemma ])
return unicode(lemma)

If you can give us any hint how to avoid the error - it will be great! Best Regards, Liubov Kyaes Data Engineer liubov.kyaes@ir.ee IR.ee

soras commented 5 years ago

Thank you for the report! We have encountered a very similar problem, but, unfortunately, we do not have a very good fix for this (yet). It seems that the vabamorf component (morphological analysis) runs into internal problems if it is applied on very long sentences. And this also affects NER, because NER requires morphological analysis beforehand.

In estnltk version 1.6, our current workaround is to check the sentence length before applying vabamorf. If the sentence length exceeds 15 000 words, then we split the sentence into 15 000 word chunks, apply vabamorf chunk by chunk, and then glue the results together afterwards. Workaround for v1.6 is here, you could try to do something similar for v1.4.