adobe / NLP-Cube

Natural Language Processing Pipeline - Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging and Dependency Parsing
http://opensource.adobe.com/NLP-Cube/index.html
Apache License 2.0
550 stars 93 forks source link

Error while training a model #122

Closed FChrispz closed 3 years ago

FChrispz commented 3 years ago

Hi, I am trying to train my own model of NLP-cube, for Uyghur language. I carefully followed all your tutorials and the TrainSet, DevSet and TestSet are all in CONLL-U format.

Once I run the command to train the default tokenizer I get an error message as you can read in the tokenizer.log file (attached below). I run the following command:

python3 /home/chris/Documents/NLPCube/NLP-Cube/cube/main.py --train=tokenizer --train-file=/home/chris/Documents/NLPCube/My_Model/trainSet.conllu --dev-file=/home/chris/Documents/NLPCube/My_Model/devSet.conllu --raw-train-file=/home/chris/Documents/NLPCube/My_Model/trainSet_raw.txt --raw-dev-file=/home/chris/Documents/NLPCube/My_Model/devSet_raw.txt --embeddings /home/chris/Documents/NLPCube/wiki.ug.vec --store /home/chris/Documents/NLPCube/My_Model/tokenizer --batch-size 1000 --set-mem 8000 --autobatch --patience 20 &> /home/chris/Documents/NLPCube/My_Model/tokenizer.log

Information about my system:

Thanks in advance for your help, Chris.

tiberiu44 commented 3 years ago

Hi @FChrispz ,

Thank you for using NLP-Cube. There seems to be an issue with the trainset file. The error refers to the fact that NLPCube did not find 10 columns inside a sentence. You should try validating your input data using the UD Tools: https://github.com/UniversalDependencies/tools (specifically validate.py).

I would also recommend that, if you are trying to build your own model, you should switch to the 3.0 branch of NLP-Cube. It is still under development, but it will give a significant boost in both runtime performance and accuracy (significant differences!).

If you want, I can assist you in training the new pipeline.

Best, Tibi

FChrispz commented 3 years ago

Hi @tiberiu44 ,

Many thanks for your help! First of all I will try to validate the datasets with the UD tools - but yes it would be great to switch to the 3.0 branch of NLP-cube!

Best,

Chris

FChrispz commented 3 years ago

Hi @tiberiu44 , sorry for late reply. We tried to validate our test data-set using the UD validation tool but there are too many errors. Could you please have a quick look at the logfile? We also tried to validate the UD file used to train NLPcube but there are errors even with this file (I attach this logfile as well). Thanks for your help.

validateTestData_Log.txt

validateUDFile_Log.txt

tiberiu44 commented 3 years ago

No problem.

I don't really know how you can fix the errors in your file. Aside for many reported UD standard violations (which will not actually impact NLP-Cube training), it is possible that you have a line (or more lines) in your file that doesn't start with # and doesn't contain a valid sentence entry (missing TABS, missing columns or something like this).

If this is an option for you, you can post the data here as an attachment and I can look over it. Otherwise, there is nothing else I can do.

FChrispz commented 3 years ago

Hi @tiberiu44 , It would be great if you can have a look at it. Thanks! Here is the file:

testSet.zip

tiberiu44 commented 3 years ago

I took a quick look over your file. I spotted three issues (there could be more).

  1. You need to have a blank space between every two sentences (you have many concatenated sentences - see line 609, 618, 698 etc.)
  2. Blank tokens are not permitted (line 1687, line 2047 etc.)
  3. Every CONLL entry must have 10 columns (separated by TAB).

Hope this helps.

FChrispz commented 3 years ago

Thanks, we will try to fix the datasets and will let you know.

tiberiu44 commented 3 years ago

Hi @FChrispz ,

Did you manage to fix the issue with the dataset? Also, we just released 3.0 officially.

tiberiu44 commented 3 years ago

I'm going to close this issue for now. If anything comes up. Feel free to open again.