Closed interlark closed 4 years ago
No idea. We used large BERT for en
and multilingual BERT for ru
ourselves and were able to train the models.
The assertion shows that the 44. sentence of the loaded CoNLL-U file has 5 words, but the contextualized embeddings have 6 words -- you can look at the sentence in question, and find out what number is correct, and why one of the scripts generated a wrong number.
To reproduce the issue:
mkdir -p bert/models/en-base-uncased
cd bert/models/en-base-uncased
wget https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip
unzip wwm_uncased_L-24_H-1024_A-16.zip
mv wwm_uncased_L-24_H-1024_A-16/* ./
cd ../../../
mkdir elmo
python3 embeddings/bert/conllu_bert_embeddings.py --language en --size base ud-2.2/en_ewt/en_ewt-ud-dev.conllu elmo/en_ewt-ud-dev.npz
python3 embeddings/bert/conllu_bert_embeddings.py --language en --size base ud-2.2/en_ewt/en_ewt-ud-test.conllu elmo/en_ewt-ud-test.npz
python3 embeddings/bert/conllu_bert_embeddings.py --language en --size base ud-2.2/en_ewt/en_ewt-ud-train.conllu elmo/en_ewt-ud-train.npz
python3 ud_parser.py ud-2.2/en_ewt/en_ewt --elmo elmo/en_ewt-ud
Traceback (most recent call last): File "ud_parser.py", line 439, in
elmo=re.sub("(?=,|$)", "-train.npz", args.elmo) if args.elmo else None) File "/content/UDPipe-Future/ud_dataset.py", line 274, in init assert self._sentence_lens[i] == len(self._elmo[i]), "{} {} {}".format(i, self._sentence_lens[i], len(self._elmo[i])) AssertionError: 61 13 14
Attempt with original BERT
mkdir -p bert/models/en-base-uncased
cd bert/models/en-base-uncased
wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
unzip uncased_L-12_H-768_A-12.zip
mv uncased_L-12_H-768_A-12/* ./
cd ../../../
mkdir elmo
python3 embeddings/bert/conllu_bert_embeddings.py --language en --size base ud-2.2/en_ewt/en_ewt-ud-dev.conllu elmo/en_ewt-ud-dev.npz
python3 embeddings/bert/conllu_bert_embeddings.py --language en --size base ud-2.2/en_ewt/en_ewt-ud-test.conllu elmo/en_ewt-ud-test.npz
python3 embeddings/bert/conllu_bert_embeddings.py --language en --size base ud-2.2/en_ewt/en_ewt-ud-train.conllu elmo/en_ewt-ud-train.npz
python3 ud_parser.py ud-2.2/en_ewt/en_ewt --elmo elmo/en_ewt-ud
Traceback (most recent call last): File "ud_parser.py", line 439, in
elmo=re.sub("(?=,|$)", "-train.npz", args.elmo) if args.elmo else None) File "/content/UDPipe-Future/ud_dataset.py", line 274, in init assert self._sentence_lens[i] == len(self._elmo[i]), "{} {} {}".format(i, self._sentence_lens[i], len(self._elmo[i])) AssertionError: 61 13 14
Maybe I do something wrong?
Maybe, It could be interesting for you, the same operations but for unimorph dataset in sigmorphon2019 works great without any assertions.
mkdir -p bert/models/en-base-uncased
cd bert/models/en-base-uncased
wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
unzip uncased_L-12_H-768_A-12.zip
mv uncased_L-12_H-768_A-12/* ./
cd ../../../
mkdir elmo
python3 embeddings/bert/conllu_bert_embeddings.py --language en --size base data/en_ewt/en_ewt-um-dev.conllu elmo/en_ewt-um-dev.npz
python3 embeddings/bert/conllu_bert_embeddings.py --language en --size base data/en_ewt/en_ewt-um-covered-test.conllu elmo/en_ewt-um-covered-test.npz
python3 embeddings/bert/conllu_bert_embeddings.py --language en --size base data/en_ewt/en_ewt-um-train.conllu elmo/en_ewt-um-train.npz
python3 um_tagger.py ud-2.2/en_ewt/en_ewt --elmo data/en_ewt-um
But here sigmorphon2019's um_dataset.py is much simpler than UDPipe-Future's ud_dataset.py
Sentence 61 of en_ewt v2.2 contains an x.y
ID:
# sent_id = weblog-blogspot.com_healingiraq_20040409053012_ENG_20040409_053012-0022
# text = Over 300 Iraqis are reported dead and 500 wounded in Fallujah alone.
1 Over over ADV RB _ 2 advmod 2:advmod _
2 300 300 NUM CD NumType=Card 3 nummod 3:nummod _
3 Iraqis Iraqis PROPN NNPS Number=Plur 5 nsubj:pass 5:nsubj:pass|6:nsubj:xsubj|8:nsubj:pass _
4 are be AUX VBP Mood=Ind|Tense=Pres|VerbForm=Fin 5 aux:pass 5:aux:pass _
5 reported report VERB VBN Tense=Past|VerbForm=Part|Voice=Pass 0 root 0:root _
6 dead dead ADJ JJ Degree=Pos 5 xcomp 5:xcomp _
7 and and CCONJ CC _ 8 cc 8:cc|8.1:cc _
8 500 500 NUM CD NumType=Card 5 conj 5:conj:and|8.1:nsubj:pass|9:nsubj:xsubj _
8.1 reported report VERB VBN Tense=Past|VerbForm=Part|Voice=Pass _ _ 5:conj:and CopyOf=5
9 wounded wounded ADJ JJ Degree=Pos 8 orphan 8.1:xcomp _
10 in in ADP IN _ 11 case 11:case _
11 Fallujah Fallujah PROPN NNP Number=Sing 5 obl 5:obl:in _
12 alone alone ADV RB _ 11 advmod 11:advmod SpaceAfter=No
13 . . PUNCT . _ 5 punct 5:punct _
Note that token 8.1
is not part of the parse tree (head -
and label -
). Udpipe-future probably skips it. Check that conllu_bert_embeddings.py
handles x.y
IDs the same way.
Nice catch. For some reason, the conllu_bert_embeddings.py
in the repository was an older version not ignoring enhanced nodes and also multi-word tokens. Should be fixed in 70a881785.
I tried to train _enewt with BERT-Base-Large-Uncased, _rusyntagrus with BERT-Base-Multilingual-Uncased. Used this script to obtain npz-embeddings for train, dev and test datasets. Then during training with --elmo on loading datasets I get asserts on 44th (ru_syntagrus) and 60th (en_ewt) sentence here: https://github.com/CoNLL-UD-2018/UDPipe-Future/blob/574f06acfdf090d66334d0d41a4a5315aa8933d3/ud_dataset.py#L273-L274 Like
Any ideas why could that happen?