bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
209 stars 33 forks source link

iterations in parser training do not start #105

Closed toufiglu closed 1 year ago

toufiglu commented 1 year ago

Hi, I was training udpipe with a large treebank, and the process halted halfway (the iterations did not start). The code is from https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-train.html#example as follows:

m_nyuad <- udpipe_train(file = "nyuad.udpipe", files_conllu_training = train_nyuad,
                  files_conllu_holdout = dev_nyuad,
                  annotation_tokenizer = "default",
                  annotation_tagger = "default",
                  annotation_parser = "default")

I got in the console the following:

Training tokenizer with the following options: tokenize_url=1, allow_spaces=0, dimension=24
  epochs=100, batch_size=50, segment_size=50, learning_rate=0.0050, learning_rate_final=0.0000
  dropout=0.1000, early_stopping=1
Epoch 1, logprob: -5.6978e+04, training acc: 96.00%, heldout tokens: 99.99%P/100.00%R/100.00%, sentences: 72.81%P/74.82%R/73.80%
Epoch 2, logprob: -6.4166e+03, training acc: 99.58%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 75.92%P/73.51%R/74.70%
Epoch 3, logprob: -5.6153e+03, training acc: 99.61%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 76.66%P/75.73%R/76.19%
Epoch 4, logprob: -5.3968e+03, training acc: 99.64%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 76.87%P/75.98%R/76.42%
Epoch 5, logprob: -5.1847e+03, training acc: 99.64%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 74.90%P/75.13%R/75.01%
Epoch 6, logprob: -5.1285e+03, training acc: 99.64%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 75.87%P/75.53%R/75.70%
Epoch 7, logprob: -4.9165e+03, training acc: 99.66%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 76.46%P/74.57%R/75.50%
Epoch 8, logprob: -4.9355e+03, training acc: 99.65%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 75.74%P/75.63%R/75.69%
Epoch 9, logprob: -4.7308e+03, training acc: 99.66%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 76.36%P/75.63%R/75.99%
Epoch 10, logprob: -4.7043e+03, training acc: 99.66%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 77.33%P/76.08%R/76.70%
Epoch 11, logprob: -4.4821e+03, training acc: 99.67%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 76.46%P/75.73%R/76.09%
Epoch 12, logprob: -4.5385e+03, training acc: 99.67%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 76.45%P/75.83%R/76.14%
Epoch 13, logprob: -4.6376e+03, training acc: 99.67%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 76.30%P/73.77%R/75.01%
Epoch 14, logprob: -4.3849e+03, training acc: 99.69%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 77.32%P/76.03%R/76.67%
Epoch 15, logprob: -4.5771e+03, training acc: 99.66%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 77.32%P/75.88%R/76.59%
Epoch 16, logprob: -4.6663e+03, training acc: 99.66%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 76.54%P/75.73%R/76.13%
Epoch 17, logprob: -4.2048e+03, training acc: 99.70%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 77.38%P/75.63%R/76.50%
Epoch 18, logprob: -4.4208e+03, training acc: 99.67%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 75.72%P/72.41%R/74.03%
Epoch 19, logprob: -4.3628e+03, training acc: 99.68%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 71.61%P/73.92%R/72.75%
Epoch 20, logprob: -4.2361e+03, training acc: 99.69%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 76.47%P/74.62%R/75.54%
Epoch 21, logprob: -4.3489e+03, training acc: 99.68%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 76.85%P/75.73%R/76.29%
Epoch 22, logprob: -4.3736e+03, training acc: 99.68%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 74.97%P/74.97%R/74.97%
Epoch 23, logprob: -4.2592e+03, training acc: 99.70%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 76.68%P/75.83%R/76.25%
Epoch 24, logprob: -4.1691e+03, training acc: 99.69%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 77.30%P/75.63%R/76.46%
Epoch 25, logprob: -4.1167e+03, training acc: 99.70%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 76.44%P/75.33%R/75.88%
Epoch 26, logprob: -3.9812e+03, training acc: 99.70%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 74.79%P/74.97%R/74.88%
Epoch 27, logprob: -4.2165e+03, training acc: 99.68%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 76.03%P/74.12%R/75.06%
Epoch 28, logprob: -4.3696e+03, training acc: 99.67%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 75.15%P/75.08%R/75.11%
Epoch 29, logprob: -4.1487e+03, training acc: 99.69%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 75.04%P/75.38%R/75.21%
Epoch 30, logprob: -4.0148e+03, training acc: 99.71%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 77.60%P/74.67%R/76.11%
Epoch 31, logprob: -3.9268e+03, training acc: 99.71%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 77.10%P/75.28%R/76.18%
Epoch 32, logprob: -4.2483e+03, training acc: 99.70%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 76.66%P/73.41%R/75.00%
Epoch 33, logprob: -3.9703e+03, training acc: 99.71%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 76.87%P/75.13%R/75.99%
Epoch 34, logprob: -4.1486e+03, training acc: 99.69%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 76.62%P/75.58%R/76.10%
Epoch 35, logprob: -3.8205e+03, training acc: 99.71%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 76.14%P/76.18%R/76.16%
Epoch 36, logprob: -4.0144e+03, training acc: 99.71%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 77.34%P/72.51%R/74.84%
Epoch 37, logprob: -3.8877e+03, training acc: 99.72%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 76.58%P/76.08%R/76.33%
Epoch 38, logprob: -4.2159e+03, training acc: 99.69%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 77.83%P/74.07%R/75.90%
Epoch 39, logprob: -3.9850e+03, training acc: 99.71%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 76.79%P/75.63%R/76.20%
Epoch 40, logprob: -4.0668e+03, training acc: 99.70%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 77.02%P/74.77%R/75.88%
Epoch 41, logprob: -4.0416e+03, training acc: 99.70%, heldout tokens: 100.00%P/100.00%R/100.00%, sentences: 76.55%P/75.63%R/76.09%
Stopping after 30 iterations of not improving sum of sentence and token f1.
Choosing parameters from epoch 10.
Tagger model 1 columns: lemma use=1/provide=1, xpostag use=1/provide=1, feats use=1/provide=1
Creating morphological dictionary for tagger model 1.
Tagger model 1 dictionary options: max_form_analyses=0, custom dictionary_file=none
Tagger model 1 guesser options: suffix_rules=8, prefixes_max=4, prefix_min_count=10, enrich_dictionary=6

I did not get any error signal so don`t know what to fix (sorry beginner in R and ud). I checked the above website but this kind of situation is not mentioned there. Thanks if anyone could help!

jwijffels commented 1 year ago

If there is an error, you get the message in the resulting object, for your case it is m_nyuad, can you print it out?

annotation_tokenizer = "default",
annotation_tagger = "default",
annotation_parser = "default"

Additionally annotation_tokenizer, annotation_taggerand annotation_parserall need to be lists with training arguments, as shown in the example of https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-train.html#example It could also be that your training data is not well formatted. As in it has too many word variants for the same lemma or your training data does not have upos/xpos/lemma.

toufiglu commented 1 year ago

Hi! Thanks so much for answering! I will try to tinker with the training arguments. Regarding m_nyuad, the resulting object has the following contents:

m_nyuad[["file"]]
[1] "nyuad.udpipe"
> m_nyuad[["model"]]
<pointer: 0x0>

It does not indicate errors. The training data actually passed other parsers, like spacy and stanza, but I will validate it further. Thank you!

jwijffels commented 1 year ago

The quickest to see possible errors is to reduce the iterations parameter to make sure your code runs. Can you share your data you are using to train the model?

toufiglu commented 1 year ago

Apologies for the delay. It was the start of the term. I set the iteration parameters for taggers and parsers, as is done in https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-train.html#example. I run it again. Both taggers can be run now, but the parser still does not execute. The console presents this:

Training tokenizer with the following options: tokenize_url=1, allow_spaces=0, dimension=24
  epochs=1, batch_size=100, segment_size=50, learning_rate=0.0050, learning_rate_final=0.0000
  dropout=0.1000, early_stopping=1
Epoch 1, logprob: -1.6863e+05, training acc: 88.39%, heldout tokens: 99.92%P/99.96%R/99.94%, sentences: 70.74%P/73.41%R/72.05%
Choosing parameters from epoch 1.
Tagger model 1 columns: lemma use=0/provide=0, xpostag use=1/provide=1, feats use=1/provide=1
Creating morphological dictionary for tagger model 1.
Tagger model 1 dictionary options: max_form_analyses=0, custom dictionary_file=none
Tagger model 1 guesser options: suffix_rules=8, prefixes_max=0, prefix_min_count=10, enrich_dictionary=6
Tagger model 1 options: iterations=1, early_stopping=1, templates=tagger
Training tagger model 1.
Iteration 1: done, accuracy 89.60%, heldout accuracy 51.60%t/100.00%l/51.60%b
Chosen tagger model from iteration 1
Tagger model 2 columns: lemma use=1/provide=1, xpostag use=0/provide=0, feats use=0/provide=0
Creating morphological dictionary for tagger model 2.
Tagger model 2 dictionary options: max_form_analyses=0, custom dictionary_file=none
Tagger model 2 guesser options: suffix_rules=6, prefixes_max=4, prefix_min_count=10, enrich_dictionary=4

And in evaluation, I got:

> # evaluate nyuad 
> m_nyuad <- udpipe_load_model ("nyuad.udpipe") 
> goodness_of_fit <- udpipe_accuracy(m_nyuad, test_nyuad, tokenizer = "default", tagger = "default", parser = "default")
Error in udp_evaluate(object$model, file_conllu, f, tokenizer, tagger,  : 
  external pointer is not valid

The udpipe file for this model is still empty. I understand that only the parser did not work. I am really sorry for this long delay. If still possible, may I send you an email including the dataset? The NYUAD treebank is on the ud website, but it requires purchasing several treebanks on the linguistic data consortium.

Thanks!

jwijffels commented 1 year ago

Yes go ahead, you can find my email in the DESCRIPTION file

toufiglu commented 1 year ago

hi I just sent it! Thanks!

jwijffels commented 1 year ago

Thanks for the data. The training stops at the lemmatizer. This is because your training data contains too much variants of word forms for the same lemma, namely for lemmata DEFAULT, Noneand TBupdate which appear to be having respectively 1238, 953 and 849 unique word forms.

Either drop the lemmatizer from the training or fix your lemma training data so that the number of word forms for each lemma is small. If you print out the result of the call to udpipe_train, you would see that it returns the error message which looks like Should encode value 505 in one byte! This is the same error as here: https://github.com/ufal/udpipe/issues/130

toufiglu commented 1 year ago

Thank you so much, Jon!