explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.15k stars 4.4k forks source link

Can't train external NER model on RONEC corpus using Spacy #5433

Closed luminitavoicu closed 4 years ago

luminitavoicu commented 4 years ago

Hello,

I attempted to use the RONEC corpus with Spacy for NER and I encountered some problems while following the tutorial for using Spacy in the RONEC project: https://github.com/dumitrescustefan/ronec/tree/master/spacy

Firstly, I cloned the repository and I tried to obtain the .json train and dev files using the convert_conllubio.py script and Spacy's convert tool as shown in the tutorial:

!python3 ronec/spacy/train-local-model/convert_conllubio.py ronec/ronec/conllup/raw/ronec.conllup .

!python -m spacy convert train_ronec.conllubio . --converter conllubio

When I ran the second command, for the train data set I got this error:

Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.6/dist-packages/spacy/main.py", line 33, in plac.call(commands[command], sys.argv[1:]) File "/usr/local/lib/python3.6/dist-packages/plac_core.py", line 367, in call cmd, result = parser.consume(arglist) File "/usr/local/lib/python3.6/dist-packages/plac_core.py", line 232, in consume return cmd, self.func(*(args + varargs + extraopts), **kwargs) File "/usr/local/lib/python3.6/dist-packages/spacy/cli/convert.py", line 106, in convert no_print=no_print, File "/usr/local/lib/python3.6/dist-packages/spacy/cli/converters/conllu2json.py", line 25, in conllu2json for i, (raw_text, tokens) in enumerate(conll_tuples): File "/usr/local/lib/python3.6/dist-packages/spacy/cli/converters/conllu2json.py", line 68, in readconllx id, word, lemma, pos, tag, morph, head, dep, _1, iob = parts ValueError: too many values to unpack (expected 10)

When I looked at the train_ronec.conllubio file, I noticed that there were 11 columns on the first line instead of 10, as shown below:

1 Tot tot ADV Rp 3 advmod _ *

I found that deleting the "*" on the first line solved this problem, but I couldn't really understand why this happened.

I moved on with the tutorial and I attempted to train the open-source BILSTM-CNN model found here: https://github.com/kamalkraj/Named-Entity-Recognition-with-Bidirectional-LSTM-CNNs with Spacy's train tool, using this command:

!python3 -m spacy train ro Named-Entity-Recognition-with-Bidirectional-LSTM-CNNs/models/ train_ronec.json dev_ronec.json -p ner

I noticed a very strange behaviour for this: the model got stuck at 36%, no matter how much time I let it run. This is the output I got:

Training pipeline: ['ner'] Starting with blank model 'ro' Counting training words (limit=0) Itn NER Loss NER P NER R NER F Token % CPU WPS 36% 58105/159192 [00:10<00:16, 6216.60it/s]

Since it did not return any errors, I am not sure how to debug it, or if I am using it right.

Environment

I am running this on Google Colab. Here is some information about the environment:

adrianeboyd commented 4 years ago

I think their conversion script has a few minor bugs. Here's my updated version (that also shuffles the sentences, which you may not want):

https://github.com/adrianeboyd/ronec/blob/871218057abbc20ffcffcb3f6335eeed3b6f03bd/spacy/train-local-model/convert_conllubio.py

In good news, we'll have Romanian models with vectors trained on RONEC available for spacy v2.3.0 soon!

luminitavoicu commented 4 years ago

Hi, thank you for the quick reply!

Good news: I used the updated script you provided and I noticed that the unnecessary "*" in the train file was no longer an issue. Moreover, after runing the spacy converter on the train collu file with the command python -m spacy convert train_ronec.conllu . --converter conllu, the ner tags appeared in the train json as well (before they were missing), so this sounds like progress to me.

Unfortunately, the model still gets stuck during training.

adrianeboyd commented 4 years ago

Try running spacy debug-data on the data to see if there are any errors or warnings (add -p ner to get it to skip the tagger/parser analysis)?

luminitavoicu commented 4 years ago

I posted a similar issue on the RONEC repository as well: https://github.com/dumitrescustefan/ronec/issues/2 because I wasn't sure if this was a spacy problem or if there was a problem with the RONEC conversion script.

Fortunately, they updated their script and all the problems are now gone. Apparently, spacy modified the converter and the compatibility with the script was affected.

Thank you for all the help!

adrianeboyd commented 4 years ago

Glad to hear it's working!

github-actions[bot] commented 3 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.