GluonNLP 0.9.x RoBERTa model return inconsistent encoded result comparing to Fairseq

kaonashi-tyc commented 4 years ago

Description

I am working on porting XLM-R model to gluonnlp (view this script here convert_xlmr_model_to_gluon.py).

The XLM-R model reuses the same RoBERTa model architecture from fairseq.

When using gluonnlp 0.8.x, the converted model can match the fairseq reference implementation with a small error range within 5e-6, and the std-dev is small (< 8e-7)

Once updated gluonnlp to 0.9.x, the result no longer matches, and std-dev increases to 0.2+ ranges, which indicates drastically different output.

Error Message

*** Maximum errors for vector of size 3840:  rtol=0.001, atol=0.001

  1: Error 1004.472290  Location of error: (0, 3, 152), a=1.01276851, b=-1.85498929
  2: Error 704.275513  Location of error: (0, 4, 588), a=1.44983745, b=7.28419113
  3: Error 647.055481  Location of error: (0, 3, 465), a=-0.56997168, b=0.21840204
  4: Error 600.452820  Location of error: (0, 0, 588), a=1.60323822, b=5.51547194
  5: Error 538.981934  Location of error: (0, 2, 749), a=-0.70366722, b=-0.10700922
  6: Error 522.929871  Location of error: (0, 2, 465), a=-0.22267081, b=0.62938148
  7: Error 508.802521  Location of error: (0, 3, 467), a=0.50823522, b=-0.00115502
  8: Error 508.074951  Location of error: (0, 2, 459), a=-0.04906127, b=-1.13256359
  9: Error 489.669586  Location of error: (0, 2, 152), a=0.31114388, b=-0.34982380
 10: Error 481.314819  Location of error: (0, 2, 145), a=1.19826353, b=0.48399478
Traceback (most recent call last):
  File "xlmr_conversion/convert_vocab.py", line 198, in <module>
    check_output('Hello world!')
  File "xlmr_conversion/convert_vocab.py", line 192, in check_output
    mx.test_utils.assert_almost_equal(mx_output.asnumpy(), torch_output, atol=1e-3, rtol=1e-3)
  File "/efs/users/tiayuche/mxnet16/lib/python3.6/site-packages/mxnet/test_utils.py", line 627, in assert_almost_equal
    raise AssertionError(msg)
AssertionError:
Items are not equal:
Error 1004.472290 exceeds tolerance rtol=1.000000e-03, atol=1.000000e-03 (mismatch at least 0.286458%).
Location of maximum error: (0, 3, 152), a=1.01276851, b=-1.85498929
 ACTUAL: array([[[ 0.09790944,  0.08255608,  0.01538194, ..., -0.20328012,
          0.02295702,  0.06142968],
        [-0.13235903,  0.02904704, -0.09925546, ..., -0.03289408,...
 DESIRED: array([[[ 0.28058335,  0.18484427,  0.09018373, ..., -0.2651089 ,
          0.10111443, -0.05324648],
        [-0.05585899,  0.07528387, -0.04672416, ...,  0.12345381,...

To Reproduce

Steps to reproduce

(Paste the commands you ran that produced the error.)

download the xlm-r model artifacts from fairseq repo: https://github.com/pytorch/fairseq/tree/master/examples/xlmr
Install both fairseq and gluonnlp and sentencepieces via pip
run the following command

python convert_xlmr_model_to_gluon.py --ckpt_dir xlmr_ckpt/xlmr.base/ --verbose

What have you tried to solve it?

I have notified @eric-haibin-lin and identified the regression happens after this commit:

184a0007bc4165d5fe080a58dd3ff9bb413203a6

Further investigation will be needed to pinpoint the exact change causing this bug.

Environment

We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:


curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

----------Python Info----------
Version      : 3.6.6
Compiler     : GCC 7.2.0
Build        : ('default', 'Jun 28 2018 17:14:51')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 20.0.2
Directory    : XXXXX
----------MXNet Info-----------
Version      : 1.6.0
Directory    : XXXX
Num GPUs     : 4
Commit Hash   : 6eec9da55c5096079355d1f1a5fa58dcf35d6752
----------System Info----------
Platform     : Linux-4.4.0-1102-aws-x86_64-with-debian-stretch-sid
system       : Linux
node         : ip-172-31-21-222
release      : 4.4.0-1102-aws
version      : #113-Ubuntu SMP Wed Jan 29 14:54:54 UTC 2020
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0011 sec, LOAD: 0.0248 sec.
Timing for GluonNLP GitHub: https://github.com/dmlc/gluon-nlp, DNS: 0.0003 sec, LOAD: 0.0223 sec.
Timing for GluonNLP: http://gluon-nlp.mxnet.io, DNS: 0.0004 sec, LOAD: 0.0200 sec.
Timing for D2L: http://d2l.ai, DNS: 0.0003 sec, LOAD: 0.0035 sec.
Timing for D2L (zh-cn): http://zh.d2l.ai, DNS: 0.0003 sec, LOAD: 0.0081 sec.
Timing for FashionMNIST: https://repo.mxnet.io/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0005 sec, LOAD: 0.3333 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0004 sec, LOAD: 0.0586 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0003 sec, LOAD: 0.0333 sec.```

kaonashi-tyc commented 4 years ago

Hi, any updates on this issue?

eric-haibin-lin commented 4 years ago

@leezu was there any known inconsistency issue with the refactoring PR?

eric-haibin-lin commented 4 years ago

dependencies:

ubuntu@ip-172-31-57-164:~$ pip3 list | grep mxnet
mxnet-cu100         1.6.0
ubuntu@ip-172-31-57-164:~$ pip3 list | grep torch
torch               1.5.0
ubuntu@ip-172-31-57-164:~$ pip3 list | grep fairseq
fairseq             0.9.0

wget https://dl.fbaipublicfiles.com/fairseq/models/xlmr.base.tar.gz
python3 convert_xlmr_model_to_gluon.py --ckpt_dir /home/ubuntu/src/xlmr.base/ --verbose

with commit 4e555394baf557e5e55e1ae24a2147b03dce2213

torch.Size([514, 768])
stdev =  6.2218635e-07
stdev =  2.409921e-07
stdev =  2.4556107e-07
stdev =  2.0841426e-07
stdev =  2.5782302e-07

with commit 184a000


torch.Size([514, 768])
stdev =  0.00013029973
stdev =  0.00016538663

*** Maximum errors for vector of size 4608:  rtol=0.001, atol=0.001

  1: Error 4.499456  Location of error: (0, 1, 152), a=-0.09637016, b=-0.10132553
  2: Error 1.222937  Location of error: (0, 4, 152), a=0.10587245, b=0.10452169
stdev =  8.325626e-05
stdev =  0.00012943218

*** Maximum errors for vector of size 5376:  rtol=0.001, atol=0.001

  1: Error 1.608742  Location of error: (0, 4, 145), a=-0.08318494, b=-0.08493032
  2: Error 1.293719  Location of error: (0, 5, 152), a=0.03806081, b=0.03671959
  3: Error 1.101096  Location of error: (0, 4, 459), a=-0.71366501, b=-0.71555400
stdev =  8.076113e-05

leezu commented 4 years ago

We can fix commit 184a000 as follows

But there are more bugs introduced later

commit 2a964133c43516db5dc97ae320e80c337b0317e8 (HEAD -> fixfairseq, ec2/fixfairseq)
Author: Leonard Lausen <lausen@amazon.com>
Date:   Thu Apr 30 07:02:55 2020 +0000

    Fix layer_norm_eps in BERTEncoder

diff --git a/src/gluonnlp/model/bert.py b/src/gluonnlp/model/bert.py
index 27bea07d0..ec234c5b3 100644
--- a/src/gluonnlp/model/bert.py
+++ b/src/gluonnlp/model/bert.py
@@ -112,11 +112,12 @@ class BERTEncoder(HybridBlock, Seq2SeqEncoder):
         self._output_attention = output_attention
         self._output_all_encodings = output_all_encodings
         self._dropout = dropout
+        self._layer_norm_eps = layer_norm_eps

         with self.name_scope():
             if dropout:
                 self.dropout_layer = nn.Dropout(rate=dropout)
-            self.layer_norm = nn.LayerNorm(in_channels=units, epsilon=1e-12)
+            self.layer_norm = nn.LayerNorm(in_channels=units, epsilon=self._layer_norm_eps)
             self.position_weight = self.params.get('position_weight', shape=(max_length, units),
                                                    init=weight_initializer)
             self.transformer_cells = nn.HybridSequential()
@@ -344,7 +345,7 @@ class BERTModel(HybridBlock):
             decoder = nn.HybridSequential(prefix=prefix)
             decoder.add(nn.Dense(units, flatten=False))
             decoder.add(GELU())
-            decoder.add(nn.LayerNorm(in_channels=units, epsilon=1e-12))
+            decoder.add(nn.LayerNorm(in_channels=units, epsilon=self.encoder._layer_norm_eps))
             decoder.add(nn.Dense(vocab_size, flatten=False, params=embed.collect_params()))
         assert decoder[3].weight == list(embed.collect_params().values())[0], \
             'The weights of word embedding are not tied with those of decoder'

With that fix, we get

stdev =  3.8439632e-07
stdev =  2.3018535e-07
stdev =  2.380526e-07
stdev =  2.1728458e-07
stdev =  5.5794425e-07

leezu commented 4 years ago

@eric-haibin-lin @kaonashi-tyc: @acphile helped to investigate which commit introduces the regression (after the layer_norm eps fix is applied). He finds that 75c29a3518ee42b98cc651b6922cbae85d2e961e is the first bad commit where regression still occurs after modifying src/gluonnlp/model/bert.py

eric-haibin-lin commented 4 years ago

Maybe because we switched to softmax with length, which assigns exact 0 to the masked position, where the older version assigns a number very close to 0...

kaonashi-tyc commented 4 years ago

Just check-back, is this issue resolved as of now?

szha commented 4 years ago

We found some similar issue in GluonNLP numpy version and came to this fix: https://github.com/apache/incubator-mxnet/pull/18827. Let's verify if this patch fixes this problem too.

eric-haibin-lin commented 4 years ago

@barry-jin could you help test it with latest mxnet and see if the issue persist?

barry-jin commented 4 years ago

@eric-haibin-lin Sure.

barry-jin commented 4 years ago

Dependencies:

ubuntu@ip-172-31-20-128:~$ pip3 list | grep mxnet
mxnet-cu102         1.6.0
ubuntu@ip-172-31-20-128:~$ pip3 list | grep torch
torch               1.6.0
ubuntu@ip-172-31-20-128:~$ pip3 list | grep fairseq
fairseq             0.9.0

Steps:

wget https://dl.fbaipublicfiles.com/fairseq/models/xlmr.base.tar.gz
tar -xzvf xmlr.base.tar.gz
python3 convert_xlmr_model_to_gluon.py --ckpt_dir /home/ubuntu/tmp/xlmr.base/ --verbose

Output:

torch.Size([514, 768])
stdev =  0.19624774

*** Maximum errors for vector of size 3840:  rtol=0.001, atol=0.001

  1: Error 1004.471985  Location of error: (0, 3, 152), a=1.01276755, b=-1.85499024
  2: Error 704.276367  Location of error: (0, 4, 588), a=1.44983697, b=7.28421259
  3: Error 647.054993  Location of error: (0, 3, 465), a=-0.56997108, b=0.21840222
  4: Error 600.453857  Location of error: (0, 0, 588), a=1.60323966, b=5.51549196
  5: Error 538.982544  Location of error: (0, 2, 749), a=-0.70366699, b=-0.10700862
  6: Error 522.929871  Location of error: (0, 2, 465), a=-0.22267099, b=0.62938124
  7: Error 508.801910  Location of error: (0, 3, 467), a=0.50823474, b=-0.00115474
  8: Error 508.074127  Location of error: (0, 2, 459), a=-0.04906178, b=-1.13256073
  9: Error 489.671600  Location of error: (0, 2, 152), a=0.31114697, b=-0.34982315
 10: Error 481.318054  Location of error: (0, 2, 145), a=1.19826555, b=0.48399293
stdev =  0.20739098

*** Maximum errors for vector of size 4608:  rtol=0.001, atol=0.001

  1: Error 1183.268433  Location of error: (0, 2, 588), a=2.24659586, b=-5.80201578
  2: Error 1161.049805  Location of error: (0, 4, 588), a=2.24659729, b=-6.74044371
  3: Error 928.264038  Location of error: (0, 1, 152), a=-1.12364161, b=-0.10132299
  4: Error 601.523010  Location of error: (0, 2, 259), a=-0.30029136, b=0.75595784
  5: Error 580.307129  Location of error: (0, 2, 145), a=-0.06280632, b=1.23304677
  6: Error 566.850525  Location of error: (0, 3, 145), a=0.53026021, b=-0.08447501
  7: Error 517.987793  Location of error: (0, 4, 145), a=-0.06280673, b=0.94433534
  8: Error 485.135864  Location of error: (0, 2, 459), a=0.22280854, b=-0.50950789
  9: Error 450.751129  Location of error: (0, 2, 12), a=0.06082506, b=0.93141055
 10: Error 446.660492  Location of error: (0, 3, 465), a=-0.33150461, b=0.20811081
stdev =  0.18995003

*** Maximum errors for vector of size 3840:  rtol=0.001, atol=0.001

  1: Error 1085.388550  Location of error: (0, 2, 588), a=1.46543539, b=-4.45078850
  2: Error 1066.772461  Location of error: (0, 1, 588), a=1.46543479, b=-5.97044420
  3: Error 787.812927  Location of error: (0, 3, 152), a=0.77056408, b=-0.08129084
  4: Error 655.578003  Location of error: (0, 2, 459), a=0.15703337, b=-1.44748211
  5: Error 601.337708  Location of error: (0, 3, 459), a=0.07274741, b=-1.32591033
  6: Error 526.234802  Location of error: (0, 1, 145), a=-0.01829262, b=1.07213926
  7: Error 472.145905  Location of error: (0, 1, 459), a=0.15703396, b=-0.59696794
  8: Error 416.252808  Location of error: (0, 3, 626), a=-0.33784640, b=0.13431576
  9: Error 395.673187  Location of error: (0, 2, 221), a=0.00158666, b=-0.65210831
 10: Error 373.908081  Location of error: (0, 2, 612), a=-0.22256255, b=0.24173050
stdev =  0.15333436

*** Maximum errors for vector of size 5376:  rtol=0.001, atol=0.001

  1: Error 1053.334717  Location of error: (0, 3, 588), a=1.30375504, b=-4.69525385
  2: Error 891.753967  Location of error: (0, 2, 152), a=0.84285378, b=-0.45175028
  3: Error 760.975037  Location of error: (0, 1, 145), a=0.68549037, b=-0.31580281
  4: Error 739.728027  Location of error: (0, 4, 459), a=0.55348921, b=-0.71555465
  5: Error 735.922119  Location of error: (0, 2, 459), a=0.12832984, b=-2.30080748
  6: Error 682.559387  Location of error: (0, 4, 152), a=0.38436258, b=-0.93937868
  7: Error 634.900940  Location of error: (0, 5, 152), a=0.69493502, b=0.03672028
  8: Error 586.230469  Location of error: (0, 2, 465), a=-0.43865040, b=0.35667226
  9: Error 583.885071  Location of error: (0, 3, 259), a=-0.20851725, b=0.90207750
 10: Error 562.726746  Location of error: (0, 4, 145), a=0.52558923, b=-0.08492993
stdev =  0.18738568

*** Maximum errors for vector of size 3840:  rtol=0.001, atol=0.001

  1: Error 1129.763306  Location of error: (0, 2, 588), a=1.67255735, b=-4.18295527
  2: Error 1094.432373  Location of error: (0, 3, 588), a=1.67261219, b=-6.12268019
  3: Error 638.781616  Location of error: (0, 1, 459), a=0.08820118, b=-1.52423167
  4: Error 575.069885  Location of error: (0, 1, 152), a=0.27381456, b=-0.70895278
  5: Error 545.580383  Location of error: (0, 2, 259), a=-0.20532960, b=0.74875915
  6: Error 454.918396  Location of error: (0, 1, 626), a=-0.42154688, b=0.06122302
  7: Error 442.742065  Location of error: (0, 3, 259), a=-0.20532867, b=0.42603865
  8: Error 399.461243  Location of error: (0, 1, 391), a=-0.13588569, b=0.43889853
  9: Error 396.245575  Location of error: (0, 1, 221), a=0.37708509, b=-0.03173557
 10: Error 390.489594  Location of error: (0, 2, 465), a=-0.22364777, b=0.27373093

szha commented 4 years ago

@barry-jin thanks for sharing. What's the GluonNLP version?

barry-jin commented 4 years ago

@szha GluonNLP version is 0.9.1

ubuntu@ip-172-31-20-128:~$ pip3 list | grep gluonnlp
gluonnlp                 0.9.1               /home/ubuntu/.local/lib/python3.6/site-packages

szha commented 4 years ago

Thanks. Could you try the nightly builds on https://dist.mxnet.io/python ? Let's check whether the master and v1.x branches are fixed. You can install the nightly builds for 2.0 with

pip3 install --pre --upgrade mxnet-cu102 -f https://dist.mxnet.io/python

and 1.x with

pip3 install --pre --upgrade 'mxnet-cu102<2' -f https://dist.mxnet.io/python

barry-jin commented 4 years ago

I tried 2.0 as follows:

Dependency change: mxnet

ubuntu@ip-172-31-20-128:~$ pip3 list | grep mxnet
mxnet-cu102              2.0.0b20200806

Stack trace for error message:

ubuntu@ip-172-31-20-128:~$ python3 convert_xlmr_model_to_gluon.py --ckpt_dir /home/ubuntu/tmp/xlmr.base/ --verbose
Traceback (most recent call last):
  File "convert_xlmr_model_to_gluon.py", line 13, in <module>
    import gluonnlp as nlp
  File "/home/ubuntu/.local/lib/python3.6/site-packages/gluonnlp/__init__.py", line 24, in <module>
    from . import loss
  File "/home/ubuntu/.local/lib/python3.6/site-packages/gluonnlp/loss/__init__.py", line 21, in <module>
    from . import activation_regularizer, loss, label_smoothing
  File "/home/ubuntu/.local/lib/python3.6/site-packages/gluonnlp/loss/activation_regularizer.py", line 23, in <module>
    from mxnet.gluon.loss import Loss
ModuleNotFoundError: No module named 'mxnet.gluon.loss'

szha commented 4 years ago

This looks very weird. I can confirm that the loss module is there...

barry-jin commented 4 years ago

Output for v1.x: Dependency:

ubuntu@ip-172-31-20-128:~$ pip3 list | grep mxnet
mxnet-cu102              1.7.0b20200807

Output:

torch.Size([514, 768])
stdev =  0.19625182
stdev =  0.20733927
stdev =  0.18992558
stdev =  0.15340993
stdev =  0.1872536

No error message, but stdev is still around 0.2

szha commented 4 years ago

@kaonashi-tyc could you take a look at the result above on 1.7?

eric-haibin-lin commented 4 years ago

The stdev is still too large

kaonashi-tyc commented 4 years ago

Yeah, the stdev is still too large IMO The previous deviation is between 1e-5

barry-jin commented 4 years ago

In my opinion, there may be something wrong with the versions prior to 4e55539.

For bert.py versions prior to 4e55539, Dot Product Attention Cell is used and in its attention weight computing function, softmax goes first and then mask is applied. This is different from the attention mechanisms stated in d2l, where, for masked_softmax, mask should be applied first and then softmax.

For bert.py current versions, softmax with length is used. In its definition file, mask is applied first and then softmax, which is consistent with d2l. So, current version of bert.py to calculate masked_softmax is right.

I'm not sure my opinion is right or not, but I found this had affected the attention weights a lot in my debug process and may have some impact on the final stdev.

eric-haibin-lin commented 4 years ago

I have a patch that fixes the incorrect bias term: https://github.com/dmlc/gluon-nlp/compare/master...eric-haibin-lin:fix-attn

dmlc / gluon-nlp