CoNLL-UD-2018 / UDPipe-Future

CoNLL 2018 Shared Task Team UDPipe-Future
Mozilla Public License 2.0
39 stars 12 forks source link

Assertions using Bert embeddings #16

Closed interlark closed 4 years ago

interlark commented 4 years ago

I tried to train _enewt with BERT-Base-Large-Uncased, _rusyntagrus with BERT-Base-Multilingual-Uncased. Used this script to obtain npz-embeddings for train, dev and test datasets. Then during training with --elmo on loading datasets I get asserts on 44th (ru_syntagrus) and 60th (en_ewt) sentence here: https://github.com/CoNLL-UD-2018/UDPipe-Future/blob/574f06acfdf090d66334d0d41a4a5315aa8933d3/ud_dataset.py#L273-L274 Like

44 5 6

Any ideas why could that happen?

foxik commented 4 years ago

No idea. We used large BERT for en and multilingual BERT for ru ourselves and were able to train the models.

The assertion shows that the 44. sentence of the loaded CoNLL-U file has 5 words, but the contextualized embeddings have 6 words -- you can look at the sentence in question, and find out what number is correct, and why one of the scripts generated a wrong number.

interlark commented 4 years ago

To reproduce the issue:

mkdir -p bert/models/en-base-uncased
cd bert/models/en-base-uncased
wget https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip
unzip wwm_uncased_L-24_H-1024_A-16.zip
mv wwm_uncased_L-24_H-1024_A-16/* ./
cd ../../../
mkdir elmo
python3 embeddings/bert/conllu_bert_embeddings.py --language en --size base ud-2.2/en_ewt/en_ewt-ud-dev.conllu elmo/en_ewt-ud-dev.npz
python3 embeddings/bert/conllu_bert_embeddings.py --language en --size base ud-2.2/en_ewt/en_ewt-ud-test.conllu elmo/en_ewt-ud-test.npz
python3 embeddings/bert/conllu_bert_embeddings.py --language en --size base ud-2.2/en_ewt/en_ewt-ud-train.conllu elmo/en_ewt-ud-train.npz

python3 ud_parser.py ud-2.2/en_ewt/en_ewt --elmo elmo/en_ewt-ud

Traceback (most recent call last): File "ud_parser.py", line 439, in elmo=re.sub("(?=,|$)", "-train.npz", args.elmo) if args.elmo else None) File "/content/UDPipe-Future/ud_dataset.py", line 274, in init assert self._sentence_lens[i] == len(self._elmo[i]), "{} {} {}".format(i, self._sentence_lens[i], len(self._elmo[i])) AssertionError: 61 13 14

Attempt with original BERT

mkdir -p bert/models/en-base-uncased
cd bert/models/en-base-uncased
wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
unzip uncased_L-12_H-768_A-12.zip
mv uncased_L-12_H-768_A-12/* ./
cd ../../../
mkdir elmo
python3 embeddings/bert/conllu_bert_embeddings.py --language en --size base ud-2.2/en_ewt/en_ewt-ud-dev.conllu elmo/en_ewt-ud-dev.npz
python3 embeddings/bert/conllu_bert_embeddings.py --language en --size base ud-2.2/en_ewt/en_ewt-ud-test.conllu elmo/en_ewt-ud-test.npz
python3 embeddings/bert/conllu_bert_embeddings.py --language en --size base ud-2.2/en_ewt/en_ewt-ud-train.conllu elmo/en_ewt-ud-train.npz

python3 ud_parser.py ud-2.2/en_ewt/en_ewt --elmo elmo/en_ewt-ud

Traceback (most recent call last): File "ud_parser.py", line 439, in elmo=re.sub("(?=,|$)", "-train.npz", args.elmo) if args.elmo else None) File "/content/UDPipe-Future/ud_dataset.py", line 274, in init assert self._sentence_lens[i] == len(self._elmo[i]), "{} {} {}".format(i, self._sentence_lens[i], len(self._elmo[i])) AssertionError: 61 13 14

Maybe I do something wrong?

interlark commented 4 years ago

Maybe, It could be interesting for you, the same operations but for unimorph dataset in sigmorphon2019 works great without any assertions.

mkdir -p bert/models/en-base-uncased
cd bert/models/en-base-uncased
wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
unzip uncased_L-12_H-768_A-12.zip
mv uncased_L-12_H-768_A-12/* ./
cd ../../../
mkdir elmo
python3 embeddings/bert/conllu_bert_embeddings.py --language en --size base data/en_ewt/en_ewt-um-dev.conllu elmo/en_ewt-um-dev.npz
python3 embeddings/bert/conllu_bert_embeddings.py --language en --size base data/en_ewt/en_ewt-um-covered-test.conllu elmo/en_ewt-um-covered-test.npz
python3 embeddings/bert/conllu_bert_embeddings.py --language en --size base data/en_ewt/en_ewt-um-train.conllu elmo/en_ewt-um-train.npz

python3 um_tagger.py ud-2.2/en_ewt/en_ewt --elmo data/en_ewt-um

But here sigmorphon2019's um_dataset.py is much simpler than UDPipe-Future's ud_dataset.py

jowagner commented 4 years ago

Sentence 61 of en_ewt v2.2 contains an x.y ID:

# sent_id = weblog-blogspot.com_healingiraq_20040409053012_ENG_20040409_053012-0022
# text = Over 300 Iraqis are reported dead and 500 wounded in Fallujah alone.
1       Over    over    ADV     RB      _       2       advmod  2:advmod        _
2       300     300     NUM     CD      NumType=Card    3       nummod  3:nummod        _
3       Iraqis  Iraqis  PROPN   NNPS    Number=Plur     5       nsubj:pass      5:nsubj:pass|6:nsubj:xsubj|8:nsubj:pass _
4       are     be      AUX     VBP     Mood=Ind|Tense=Pres|VerbForm=Fin        5       aux:pass        5:aux:pass      _
5       reported        report  VERB    VBN     Tense=Past|VerbForm=Part|Voice=Pass     0       root    0:root  _
6       dead    dead    ADJ     JJ      Degree=Pos      5       xcomp   5:xcomp _
7       and     and     CCONJ   CC      _       8       cc      8:cc|8.1:cc     _
8       500     500     NUM     CD      NumType=Card    5       conj    5:conj:and|8.1:nsubj:pass|9:nsubj:xsubj _
8.1     reported        report  VERB    VBN     Tense=Past|VerbForm=Part|Voice=Pass     _       _       5:conj:and      CopyOf=5
9       wounded wounded ADJ     JJ      Degree=Pos      8       orphan  8.1:xcomp       _
10      in      in      ADP     IN      _       11      case    11:case _
11      Fallujah        Fallujah        PROPN   NNP     Number=Sing     5       obl     5:obl:in        _
12      alone   alone   ADV     RB      _       11      advmod  11:advmod       SpaceAfter=No
13      .       .       PUNCT   .       _       5       punct   5:punct _

Note that token 8.1 is not part of the parse tree (head - and label -). Udpipe-future probably skips it. Check that conllu_bert_embeddings.py handles x.y IDs the same way.

foxik commented 4 years ago

Nice catch. For some reason, the conllu_bert_embeddings.py in the repository was an older version not ignoring enhanced nodes and also multi-word tokens. Should be fixed in 70a881785.