Closed kaonashi-tyc closed 4 years ago
Hi, any updates on this issue?
@leezu was there any known inconsistency issue with the refactoring PR?
dependencies:
ubuntu@ip-172-31-57-164:~$ pip3 list | grep mxnet
mxnet-cu100 1.6.0
ubuntu@ip-172-31-57-164:~$ pip3 list | grep torch
torch 1.5.0
ubuntu@ip-172-31-57-164:~$ pip3 list | grep fairseq
fairseq 0.9.0
wget https://dl.fbaipublicfiles.com/fairseq/models/xlmr.base.tar.gz
python3 convert_xlmr_model_to_gluon.py --ckpt_dir /home/ubuntu/src/xlmr.base/ --verbose
with commit 4e555394baf557e5e55e1ae24a2147b03dce2213
torch.Size([514, 768])
stdev = 6.2218635e-07
stdev = 2.409921e-07
stdev = 2.4556107e-07
stdev = 2.0841426e-07
stdev = 2.5782302e-07
with commit 184a000
torch.Size([514, 768])
stdev = 0.00013029973
stdev = 0.00016538663
*** Maximum errors for vector of size 4608: rtol=0.001, atol=0.001
1: Error 4.499456 Location of error: (0, 1, 152), a=-0.09637016, b=-0.10132553
2: Error 1.222937 Location of error: (0, 4, 152), a=0.10587245, b=0.10452169
stdev = 8.325626e-05
stdev = 0.00012943218
*** Maximum errors for vector of size 5376: rtol=0.001, atol=0.001
1: Error 1.608742 Location of error: (0, 4, 145), a=-0.08318494, b=-0.08493032
2: Error 1.293719 Location of error: (0, 5, 152), a=0.03806081, b=0.03671959
3: Error 1.101096 Location of error: (0, 4, 459), a=-0.71366501, b=-0.71555400
stdev = 8.076113e-05
We can fix commit 184a000 as follows
But there are more bugs introduced later
commit 2a964133c43516db5dc97ae320e80c337b0317e8 (HEAD -> fixfairseq, ec2/fixfairseq)
Author: Leonard Lausen <lausen@amazon.com>
Date: Thu Apr 30 07:02:55 2020 +0000
Fix layer_norm_eps in BERTEncoder
diff --git a/src/gluonnlp/model/bert.py b/src/gluonnlp/model/bert.py
index 27bea07d0..ec234c5b3 100644
--- a/src/gluonnlp/model/bert.py
+++ b/src/gluonnlp/model/bert.py
@@ -112,11 +112,12 @@ class BERTEncoder(HybridBlock, Seq2SeqEncoder):
self._output_attention = output_attention
self._output_all_encodings = output_all_encodings
self._dropout = dropout
+ self._layer_norm_eps = layer_norm_eps
with self.name_scope():
if dropout:
self.dropout_layer = nn.Dropout(rate=dropout)
- self.layer_norm = nn.LayerNorm(in_channels=units, epsilon=1e-12)
+ self.layer_norm = nn.LayerNorm(in_channels=units, epsilon=self._layer_norm_eps)
self.position_weight = self.params.get('position_weight', shape=(max_length, units),
init=weight_initializer)
self.transformer_cells = nn.HybridSequential()
@@ -344,7 +345,7 @@ class BERTModel(HybridBlock):
decoder = nn.HybridSequential(prefix=prefix)
decoder.add(nn.Dense(units, flatten=False))
decoder.add(GELU())
- decoder.add(nn.LayerNorm(in_channels=units, epsilon=1e-12))
+ decoder.add(nn.LayerNorm(in_channels=units, epsilon=self.encoder._layer_norm_eps))
decoder.add(nn.Dense(vocab_size, flatten=False, params=embed.collect_params()))
assert decoder[3].weight == list(embed.collect_params().values())[0], \
'The weights of word embedding are not tied with those of decoder'
With that fix, we get
stdev = 3.8439632e-07
stdev = 2.3018535e-07
stdev = 2.380526e-07
stdev = 2.1728458e-07
stdev = 5.5794425e-07
@eric-haibin-lin @kaonashi-tyc: @acphile helped to investigate which commit introduces the regression (after the layer_norm eps fix is applied). He finds that 75c29a3518ee42b98cc651b6922cbae85d2e961e is the first bad commit where regression still occurs after modifying src/gluonnlp/model/bert.py
Maybe because we switched to softmax with length, which assigns exact 0 to the masked position, where the older version assigns a number very close to 0...
Just check-back, is this issue resolved as of now?
We found some similar issue in GluonNLP numpy version and came to this fix: https://github.com/apache/incubator-mxnet/pull/18827. Let's verify if this patch fixes this problem too.
@barry-jin could you help test it with latest mxnet and see if the issue persist?
@eric-haibin-lin Sure.
Dependencies:
ubuntu@ip-172-31-20-128:~$ pip3 list | grep mxnet
mxnet-cu102 1.6.0
ubuntu@ip-172-31-20-128:~$ pip3 list | grep torch
torch 1.6.0
ubuntu@ip-172-31-20-128:~$ pip3 list | grep fairseq
fairseq 0.9.0
Steps:
wget https://dl.fbaipublicfiles.com/fairseq/models/xlmr.base.tar.gz
tar -xzvf xmlr.base.tar.gz
python3 convert_xlmr_model_to_gluon.py --ckpt_dir /home/ubuntu/tmp/xlmr.base/ --verbose
Output:
torch.Size([514, 768])
stdev = 0.19624774
*** Maximum errors for vector of size 3840: rtol=0.001, atol=0.001
1: Error 1004.471985 Location of error: (0, 3, 152), a=1.01276755, b=-1.85499024
2: Error 704.276367 Location of error: (0, 4, 588), a=1.44983697, b=7.28421259
3: Error 647.054993 Location of error: (0, 3, 465), a=-0.56997108, b=0.21840222
4: Error 600.453857 Location of error: (0, 0, 588), a=1.60323966, b=5.51549196
5: Error 538.982544 Location of error: (0, 2, 749), a=-0.70366699, b=-0.10700862
6: Error 522.929871 Location of error: (0, 2, 465), a=-0.22267099, b=0.62938124
7: Error 508.801910 Location of error: (0, 3, 467), a=0.50823474, b=-0.00115474
8: Error 508.074127 Location of error: (0, 2, 459), a=-0.04906178, b=-1.13256073
9: Error 489.671600 Location of error: (0, 2, 152), a=0.31114697, b=-0.34982315
10: Error 481.318054 Location of error: (0, 2, 145), a=1.19826555, b=0.48399293
stdev = 0.20739098
*** Maximum errors for vector of size 4608: rtol=0.001, atol=0.001
1: Error 1183.268433 Location of error: (0, 2, 588), a=2.24659586, b=-5.80201578
2: Error 1161.049805 Location of error: (0, 4, 588), a=2.24659729, b=-6.74044371
3: Error 928.264038 Location of error: (0, 1, 152), a=-1.12364161, b=-0.10132299
4: Error 601.523010 Location of error: (0, 2, 259), a=-0.30029136, b=0.75595784
5: Error 580.307129 Location of error: (0, 2, 145), a=-0.06280632, b=1.23304677
6: Error 566.850525 Location of error: (0, 3, 145), a=0.53026021, b=-0.08447501
7: Error 517.987793 Location of error: (0, 4, 145), a=-0.06280673, b=0.94433534
8: Error 485.135864 Location of error: (0, 2, 459), a=0.22280854, b=-0.50950789
9: Error 450.751129 Location of error: (0, 2, 12), a=0.06082506, b=0.93141055
10: Error 446.660492 Location of error: (0, 3, 465), a=-0.33150461, b=0.20811081
stdev = 0.18995003
*** Maximum errors for vector of size 3840: rtol=0.001, atol=0.001
1: Error 1085.388550 Location of error: (0, 2, 588), a=1.46543539, b=-4.45078850
2: Error 1066.772461 Location of error: (0, 1, 588), a=1.46543479, b=-5.97044420
3: Error 787.812927 Location of error: (0, 3, 152), a=0.77056408, b=-0.08129084
4: Error 655.578003 Location of error: (0, 2, 459), a=0.15703337, b=-1.44748211
5: Error 601.337708 Location of error: (0, 3, 459), a=0.07274741, b=-1.32591033
6: Error 526.234802 Location of error: (0, 1, 145), a=-0.01829262, b=1.07213926
7: Error 472.145905 Location of error: (0, 1, 459), a=0.15703396, b=-0.59696794
8: Error 416.252808 Location of error: (0, 3, 626), a=-0.33784640, b=0.13431576
9: Error 395.673187 Location of error: (0, 2, 221), a=0.00158666, b=-0.65210831
10: Error 373.908081 Location of error: (0, 2, 612), a=-0.22256255, b=0.24173050
stdev = 0.15333436
*** Maximum errors for vector of size 5376: rtol=0.001, atol=0.001
1: Error 1053.334717 Location of error: (0, 3, 588), a=1.30375504, b=-4.69525385
2: Error 891.753967 Location of error: (0, 2, 152), a=0.84285378, b=-0.45175028
3: Error 760.975037 Location of error: (0, 1, 145), a=0.68549037, b=-0.31580281
4: Error 739.728027 Location of error: (0, 4, 459), a=0.55348921, b=-0.71555465
5: Error 735.922119 Location of error: (0, 2, 459), a=0.12832984, b=-2.30080748
6: Error 682.559387 Location of error: (0, 4, 152), a=0.38436258, b=-0.93937868
7: Error 634.900940 Location of error: (0, 5, 152), a=0.69493502, b=0.03672028
8: Error 586.230469 Location of error: (0, 2, 465), a=-0.43865040, b=0.35667226
9: Error 583.885071 Location of error: (0, 3, 259), a=-0.20851725, b=0.90207750
10: Error 562.726746 Location of error: (0, 4, 145), a=0.52558923, b=-0.08492993
stdev = 0.18738568
*** Maximum errors for vector of size 3840: rtol=0.001, atol=0.001
1: Error 1129.763306 Location of error: (0, 2, 588), a=1.67255735, b=-4.18295527
2: Error 1094.432373 Location of error: (0, 3, 588), a=1.67261219, b=-6.12268019
3: Error 638.781616 Location of error: (0, 1, 459), a=0.08820118, b=-1.52423167
4: Error 575.069885 Location of error: (0, 1, 152), a=0.27381456, b=-0.70895278
5: Error 545.580383 Location of error: (0, 2, 259), a=-0.20532960, b=0.74875915
6: Error 454.918396 Location of error: (0, 1, 626), a=-0.42154688, b=0.06122302
7: Error 442.742065 Location of error: (0, 3, 259), a=-0.20532867, b=0.42603865
8: Error 399.461243 Location of error: (0, 1, 391), a=-0.13588569, b=0.43889853
9: Error 396.245575 Location of error: (0, 1, 221), a=0.37708509, b=-0.03173557
10: Error 390.489594 Location of error: (0, 2, 465), a=-0.22364777, b=0.27373093
@barry-jin thanks for sharing. What's the GluonNLP version?
@szha GluonNLP version is 0.9.1
ubuntu@ip-172-31-20-128:~$ pip3 list | grep gluonnlp
gluonnlp 0.9.1 /home/ubuntu/.local/lib/python3.6/site-packages
Thanks. Could you try the nightly builds on https://dist.mxnet.io/python ? Let's check whether the master and v1.x branches are fixed. You can install the nightly builds for 2.0 with
pip3 install --pre --upgrade mxnet-cu102 -f https://dist.mxnet.io/python
and 1.x with
pip3 install --pre --upgrade 'mxnet-cu102<2' -f https://dist.mxnet.io/python
I tried 2.0 as follows:
Dependency change: mxnet
ubuntu@ip-172-31-20-128:~$ pip3 list | grep mxnet
mxnet-cu102 2.0.0b20200806
Stack trace for error message:
ubuntu@ip-172-31-20-128:~$ python3 convert_xlmr_model_to_gluon.py --ckpt_dir /home/ubuntu/tmp/xlmr.base/ --verbose
Traceback (most recent call last):
File "convert_xlmr_model_to_gluon.py", line 13, in <module>
import gluonnlp as nlp
File "/home/ubuntu/.local/lib/python3.6/site-packages/gluonnlp/__init__.py", line 24, in <module>
from . import loss
File "/home/ubuntu/.local/lib/python3.6/site-packages/gluonnlp/loss/__init__.py", line 21, in <module>
from . import activation_regularizer, loss, label_smoothing
File "/home/ubuntu/.local/lib/python3.6/site-packages/gluonnlp/loss/activation_regularizer.py", line 23, in <module>
from mxnet.gluon.loss import Loss
ModuleNotFoundError: No module named 'mxnet.gluon.loss'
This looks very weird. I can confirm that the loss module is there...
Output for v1.x: Dependency:
ubuntu@ip-172-31-20-128:~$ pip3 list | grep mxnet
mxnet-cu102 1.7.0b20200807
Output:
torch.Size([514, 768])
stdev = 0.19625182
stdev = 0.20733927
stdev = 0.18992558
stdev = 0.15340993
stdev = 0.1872536
No error message, but stdev is still around 0.2
@kaonashi-tyc could you take a look at the result above on 1.7?
The stdev is still too large
Yeah, the stdev is still too large IMO The previous deviation is between 1e-5
In my opinion, there may be something wrong with the versions prior to 4e55539.
For bert.py versions prior to 4e55539, Dot Product Attention Cell is used and in its attention weight computing function, softmax goes first and then mask is applied. This is different from the attention mechanisms stated in d2l, where, for masked_softmax, mask should be applied first and then softmax.
For bert.py current versions, softmax with length is used. In its definition file, mask is applied first and then softmax, which is consistent with d2l. So, current version of bert.py to calculate masked_softmax is right.
I'm not sure my opinion is right or not, but I found this had affected the attention weights a lot in my debug process and may have some impact on the final stdev.
I have a patch that fixes the incorrect bias term: https://github.com/dmlc/gluon-nlp/compare/master...eric-haibin-lin:fix-attn
Description
I am working on porting XLM-R model to gluonnlp (view this script here convert_xlmr_model_to_gluon.py).
The XLM-R model reuses the same RoBERTa model architecture from fairseq.
When using gluonnlp 0.8.x, the converted model can match the fairseq reference implementation with a small error range within 5e-6, and the std-dev is small (< 8e-7)
Once updated gluonnlp to 0.9.x, the result no longer matches, and std-dev increases to 0.2+ ranges, which indicates drastically different output.
Error Message
To Reproduce
Steps to reproduce
(Paste the commands you ran that produced the error.)
download the xlm-r model artifacts from fairseq repo: https://github.com/pytorch/fairseq/tree/master/examples/xlmr
Install both fairseq and gluonnlp and sentencepieces via pip
run the following command
python convert_xlmr_model_to_gluon.py --ckpt_dir xlmr_ckpt/xlmr.base/ --verbose
What have you tried to solve it?
I have notified @eric-haibin-lin and identified the regression happens after this commit:
184a0007bc4165d5fe080a58dd3ff9bb413203a6
Further investigation will be needed to pinpoint the exact change causing this bug.
Environment
We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below: