Closed FChrispz closed 3 years ago
Hi @FChrispz ,
Thank you for using NLP-Cube. There seems to be an issue with the trainset file. The error refers to the fact that NLPCube did not find 10 columns inside a sentence. You should try validating your input data using the UD Tools: https://github.com/UniversalDependencies/tools (specifically validate.py).
I would also recommend that, if you are trying to build your own model, you should switch to the 3.0 branch of NLP-Cube. It is still under development, but it will give a significant boost in both runtime performance and accuracy (significant differences!).
If you want, I can assist you in training the new pipeline.
Best, Tibi
Hi @tiberiu44 ,
Many thanks for your help! First of all I will try to validate the datasets with the UD tools - but yes it would be great to switch to the 3.0 branch of NLP-cube!
Best,
Chris
Hi @tiberiu44 , sorry for late reply. We tried to validate our test data-set using the UD validation tool but there are too many errors. Could you please have a quick look at the logfile? We also tried to validate the UD file used to train NLPcube but there are errors even with this file (I attach this logfile as well). Thanks for your help.
No problem.
I don't really know how you can fix the errors in your file. Aside for many reported UD standard violations (which will not actually impact NLP-Cube training), it is possible that you have a line (or more lines) in your file that doesn't start with #
and doesn't contain a valid sentence entry (missing TABS
, missing columns or something like this).
If this is an option for you, you can post the data here as an attachment and I can look over it. Otherwise, there is nothing else I can do.
Hi @tiberiu44 , It would be great if you can have a look at it. Thanks! Here is the file:
I took a quick look over your file. I spotted three issues (there could be more).
TAB
). Hope this helps.
Thanks, we will try to fix the datasets and will let you know.
Hi @FChrispz ,
Did you manage to fix the issue with the dataset? Also, we just released 3.0 officially.
I'm going to close this issue for now. If anything comes up. Feel free to open again.
Hi, I am trying to train my own model of NLP-cube, for Uyghur language. I carefully followed all your tutorials and the TrainSet, DevSet and TestSet are all in CONLL-U format.
Once I run the command to train the default tokenizer I get an error message as you can read in the tokenizer.log file (attached below). I run the following command:
python3 /home/chris/Documents/NLPCube/NLP-Cube/cube/main.py --train=tokenizer --train-file=/home/chris/Documents/NLPCube/My_Model/trainSet.conllu --dev-file=/home/chris/Documents/NLPCube/My_Model/devSet.conllu --raw-train-file=/home/chris/Documents/NLPCube/My_Model/trainSet_raw.txt --raw-dev-file=/home/chris/Documents/NLPCube/My_Model/devSet_raw.txt --embeddings /home/chris/Documents/NLPCube/wiki.ug.vec --store /home/chris/Documents/NLPCube/My_Model/tokenizer --batch-size 1000 --set-mem 8000 --autobatch --patience 20 &> /home/chris/Documents/NLPCube/My_Model/tokenizer.log
Information about my system:
Thanks in advance for your help, Chris.