dmlc / gluon-nlp

NLP made easy
https://nlp.gluon.ai/
Apache License 2.0
2.55k stars 538 forks source link

Cannot apply parameter sharing on WeightDropParameter #1060

Closed liuzh47 closed 4 years ago

liuzh47 commented 4 years ago

Description

Parameter sharing of WeightDropParameter is broken. I apply weight sharing between AWDRNN and train.AWDRNN as followed:

model = nlp.model.train.AWDRNN(args.model, len(vocab), args.emsize, args.nhid, args.nlayers,
                               args.tied, args.dropout, args.weight_dropout,
                               args.dropout_h, args.dropout_i, args.dropout_e)
model_eval = nlp.model.AWDRNN(args.model, len(vocab), args.emsize, args.nhid, args.nlayers,
                              args.tied, args.dropout, args.weight_dropout,
                              args.dropout_h, args.dropout_i, args.dropout_e,
                              params=model.collect_params())

model.initialize(mx.init.Xavier(), ctx=context)

model.hybridize(static_alloc=True)

print(model)

def check_initialized(net):
    params = net.collect_params()
    for param in params:
        try:
            params[param].list_ctx()
        except RuntimeError:
            return False
    return True

print(check_initialized(model))
print(check_initialized(model_eval))

Log Message

After I ran the above code, I got the following message:

True
False

It appeared that model_eval is not properly initialized. After an inspection, we found it is the WeightDropParameter that not initialized.

(Pdb) model_eval.collect_params()["awdrnn0_hybridsequential0_embedding0_weight"].data()
*** RuntimeError: Parameter 'awdrnn0_hybridsequential0_embedding0_weight' has not been initialized. Note that you should initialize parameters and create Trainer with Block.collect_params() instead of Block.params because the later does not include Parameters of nested child Blocks
(Pdb) model_eval.collect_params()["awdrnn0_hybridsequential0_embedding0_weight"]
WeightDropParameter awdrnn0_hybridsequential0_embedding0_weight (shape=(33278, 400), dtype=float32, rate=0.1, mode=training)

Environment

My environment specs:

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

---------Python Info----------
Version      : 3.6.6
Compiler     : GCC 7.2.0
Build        : ('default', 'Jun 28 2018 17:14:51')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 19.2.3
Directory    : /home/ubuntu/anaconda3/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version      : 1.6.0
Directory    : /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet
Num GPUs     : 1
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform     : Linux-4.15.0-1056-aws-x86_64-with-debian-buster-sid
system       : Linux
node         : ip-172-31-23-26
release      : 4.15.0-1056-aws
version      : #58-Ubuntu SMP Tue Nov 26 15:14:34 UTC 2019
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:            1
CPU MHz:             2704.026
CPU max MHz:         3000.0000
CPU min MHz:         1200.0000
BogoMIPS:            4600.12
Hypervisor vendor:   Xen
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            46080K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt
----------Network Test----------
Setting timeout: 10
liuzh47 commented 4 years ago

One workaround to the above problem is to initialize the model parameters before applying weight sharing:

model = nlp.model.train.AWDRNN(args.model, len(vocab), args.emsize, args.nhid, args.nlayers,
                               args.tied, args.dropout, args.weight_dropout,
                               args.dropout_h, args.dropout_i, args.dropout_e)
model.initialize(mx.init.Xavier(), ctx=context)
model_eval = nlp.model.AWDRNN(args.model, len(vocab), args.emsize, args.nhid, args.nlayers,
                              args.tied, args.dropout, args.weight_dropout,
                              args.dropout_h, args.dropout_i, args.dropout_e,
                              params=model.collect_params())
leezu commented 4 years ago

I suggest we drop the WeightDropParameter. Instead users can manually call Dropout in forward.

Understanding WeightDropParameter requires intricate familiarity with the way Gluon handles Parameters and Blocks, and most users will have a hard time to understand it. However, our goal is to have easy to understand code. Thus in it's current implementation, WeightDropParameter doesn't align with our goals.

Further, as pointed out by @liuzh91, the implementation is currently broken.

What do others think?

liuzh47 commented 4 years ago

The above workaround does not work if we have args.tied turned on. In this case, after I ran the code check_initialized(model_eval), I got the following error message:

*** RuntimeError: Parameter 'awdrnn0_hybridsequential0_embedding0_bias' has not been initialized

It is exactly the output layer of the model_evalthat not been initialized. model_eval contains the following parameters:

model_eval.collect_params()
awdrnn0_ (
  WeightDropParameter awdrnn0_hybridsequential0_embedding0_weight (shape=(33278, 400), dtype=float32, rate=0.1, mode=training)
  Parameter awdrnn0_hybridsequential1_lstm0_l0_i2h_weight (shape=(4600, 400), dtype=float32)
  WeightDropParameter awdrnn0_hybridsequential1_lstm0_l0_h2h_weight (shape=(4600, 1150), dtype=float32, rate=0.5, mode=training)
  Parameter awdrnn0_hybridsequential1_lstm0_l0_i2h_bias (shape=(4600,), dtype=float32)
  Parameter awdrnn0_hybridsequential1_lstm0_l0_h2h_bias (shape=(4600,), dtype=float32)
  Parameter awdrnn0_hybridsequential1_lstm1_l0_i2h_weight (shape=(4600, 1150), dtype=float32)
  WeightDropParameter awdrnn0_hybridsequential1_lstm1_l0_h2h_weight (shape=(4600, 1150), dtype=float32, rate=0.5, mode=training)
  Parameter awdrnn0_hybridsequential1_lstm1_l0_i2h_bias (shape=(4600,), dtype=float32)
  Parameter awdrnn0_hybridsequential1_lstm1_l0_h2h_bias (shape=(4600,), dtype=float32)
  Parameter awdrnn0_hybridsequential1_lstm2_l0_i2h_weight (shape=(1600, 1150), dtype=float32)
  WeightDropParameter awdrnn0_hybridsequential1_lstm2_l0_h2h_weight (shape=(1600, 400), dtype=float32, rate=0.5, mode=training)
  Parameter awdrnn0_hybridsequential1_lstm2_l0_i2h_bias (shape=(1600,), dtype=float32)
  Parameter awdrnn0_hybridsequential1_lstm2_l0_h2h_bias (shape=(1600,), dtype=float32)
  Parameter awdrnn0_hybridsequential0_embedding0_bias (shape=(33278,), dtype=float32)
)

WeightDropParameter awdrnn0_hybridsequential0_embedding0_weight and Parameter awdrnn0_hybridsequential0_embedding0_bias have their weights tied up. WeightDropParameter awdrnn0_hybridsequential0_embedding0_weight is properly initialized whereas Parameter awdrnn0_hybridsequential0_embedding0_bias is not initialized. I guess it is WeightDropParameter that causes the initialization error.

liuzh47 commented 4 years ago

If I applied weight dropout during the forward pass:

embedding_param = self.collect_params()['awdrnn0_hybridsequential0_embedding0_weight']
embedding_param.set_data(mx.nd.Dropout(embedding_param.data(), 0.3))

I got the following error msg:

*** mxnet.base.MXNetError: [10:20:43] src/imperative/imperative.cc:203: Check failed: AGInfo::IsNone(*output): Assigning to NDArrays that are already in a computational graph will cause undefined behavior when evaluating gradients. Please call backward first to clear the graph or do this out side of a record section. Also note that you cannot use inplace operations like +=, *=, relu(x, out=x), y[idx]=x, etc inside a record section.

Note that my code is wrapped in the autograd.record(). The error suggests I apply the weight dropout outside the record section, but it is impractical in this case. Is there any workaround for this error?

leezu commented 4 years ago

Yes, this will not work indeed.

You need to do something like

def forward(self, ...):
    ....
    embedding_param = self.collect_params()['awdrnn0_hybridsequential0_embedding0_weight'].data(CTX)
    embedding_param = mx.nd.Dropout(embedding_param, 0.3)
    ....
liuzh47 commented 4 years ago

Yes, this will not work indeed.

You need to do something like

def forward(self, ...):
    ....
    embedding_param = self.collect_params()['awdrnn0_hybridsequential0_embedding0_weight'].data(CTX)
    embedding_param = mx.nd.Dropout(embedding_param, 0.3)
    ....

It should use slice assign embedding_param[:] instead of embedding_param. Otherwise, the weight self.collect_params()['awdrnn0_hybridsequential0_embedding0_weight'] does not change at all. Concretely, I implement the following code:

    embedding_param = self.collect_params(['awdrnn0_hybridsequential0_embedding0_weight'].data(CTX)
    embedding_param[:] = mx.nd.Dropout(embedding_param, 0.3)

But if I do it in this way, I still get the same MXNetError:

*** mxnet.base.MXNetError: [03:55:51] src/imperative/imperative.cc:203: Check failed: AGInfo::IsNone(*output): Assigning to NDArrays that are already in a computational graph will cause undefined behavior when evaluating gradients. Please call backward first to clear the graph or do this out side of a record section. Also note that you cannot use inplace operations like +=, *=, relu(x, out=x), y[idx]=x, etc inside a record section.
liuzh47 commented 4 years ago

I suggest we drop the WeightDropParameter. Instead users can manually call Dropout in forward.

Understanding WeightDropParameter requires intricate familiarity with the way Gluon handles Parameters and Blocks, and most users will have a hard time to understand it. However, our goal is to have easy to understand code. Thus in it's current implementation, WeightDropParameter doesn't align with our goals.

Further, as pointed out by @liuzh91, the implementation is currently broken.

What do others think?

It seems we cannot manually assign weights during the record session. I found a bunch of similar assignment errors from the internet, e.g., https://discuss.mxnet.io/t/autograd-record-error-about-assigning-to-ndarray/322. https://github.com/apache/incubator-mxnet/issues/9989

Alternatively, I have tried weight assignment outside the record session suggested by the error msg. However, mx.nd.Dropout deferred the evaluation until entering the with autograd.record() scope.

Perhaps we still need to switch back to the apply_weight_drop() function and fix the shared parameters problem there. In apply_weight_drop(), it technically replaces every param matching local_param_regex with a new dropped_param even the param is an instance of WeightDropParameter. If my understanding is correct, assume that the param is already an instance of WeightDropParameter by weight sharing, we do not need to create a new WeightDropParameter dropped_param and replace every occurrence of the param. The following assignment is therefore unnecessary: https://github.com/dmlc/gluon-nlp/blob/434187eb5e7bb898482021ffbc9c1648d0700608/src/gluonnlp/model/utils.py#L92 It can be replaced by

dropped_param = param

[UPDATE] If I add a check before creating a new dropped_param in L92:

if isinstance(param, WeightDropParameter):
    continue

The initialization works if I don't have args.tied turned on. If args.tied turned on, the tied weight parameters Parameter awdrnn0_hybridsequential0_embedding0_bias still suffer the initialization problem. The bug is triggered by the following code:

    with output.name_scope():
            if self._tie_weights:
                output.add(nn.Dense(self._vocab_size, flatten=False,
                                    params=self.embedding[0].params))
            else:
                output.add(nn.Dense(self._vocab_size, flatten=False))

[UPDATE]2: After some investigation, I found the following debugging info:

(Pdb) model_eval.decoder[0]._params._shared.list_ctx()
[cpu(0)]
(Pdb) model_eval.decoder[0]._params.list_ctx()
*** RuntimeError: Parameter 'awdrnn0_hybridsequential0_embedding0_bias' has not been initialized

I believe params should always have the same context with the params._shared. There is something messed up here.

[UPDATE]3: I am confused with the weight tied here. The two tied layers has different initialized values:

(Pdb) model.collect_params()['awdrnn0_hybridsequential0_embedding0_weight']._data
[
[[ 0.0097627   0.01856892  0.04303787 ... -0.08871634 -0.01311667
  -0.00243246]
 [-0.03764082  0.07620091  0.03926869 ...  0.00636984 -0.06295353
   0.06907154]
 [-0.0197481   0.00725491  0.08585829 ... -0.09896208  0.09590539
   0.09404436]
 ...
 [-0.05431841  0.02288689  0.05555419 ...  0.05949707  0.00848633
  -0.0083269 ]
 [ 0.04172196  0.00805981 -0.03484908 ... -0.08370983  0.04527131
   0.05345441]
 [ 0.07405332  0.05066857  0.02247771 ... -0.00969069  0.03976088
   0.00947104]]
<NDArray 33278x400 @cpu(0)>]
(Pdb) model.collect_params()['awdrnn0_hybridsequential0_embedding0_bias']._data
[
[0. 0. 0. ... 0. 0. 0.]
<NDArray 33278 @cpu(0)>]
leezu commented 4 years ago

Why do you want to assign the weight? You can access the non-weight-dropped weight and apply dropout to the sliced part that you are interested. It's not necessary nor possible to modify the actual parameter via assignment in the forward pass.

The code I posted above, which doesn't do any in-place modification of the weight is correct. You just need to keep the variable around so that you use the same dropped out version at any location in the forward pass.

liuzh47 commented 4 years ago

Why do you want to assign the weight? You can access the non-weight-dropped weight and apply dropout to the sliced part that you are interested. It's not necessary nor possible to modify the actual parameter via assignment in the forward pass.

The code I posted above, which doesn't do any in-place modification of the weight is correct. You just need to keep the variable around so that you use the same dropped out version at any location in the forward pass.

The key problem is that how to do gradient backpropogation if the weight is not updated? If some weights are dropped in the forward pass, they should never be updated during current iteration. I really doubt whether the gradient computation will still work correctly with your hack.

Also, keeping network parameters in a temporary variable is also not memory efficient.

sxjscience commented 4 years ago

I think we can add the weight drop flag to the Dense/RNN/Embedding layers. In that case, we can apply dropout inside hybrid_forward:

https://github.com/apache/incubator-mxnet/blob/2ad3ce408b08456bff0b92a82aa3c484adde29f2/python/mxnet/gluon/nn/basic_layers.py#L222-L228

liuzh47 commented 4 years ago

I finally figured out there is a mxnet weight sharing bug. When weight sharing is used with tied weight, something unexpected will occur. For this piece of code:

model = nlp.model.train.AWDRNN(args.model, len(vocab), args.emsize, args.nhid, args.nlayers,
                               args.tied, args.dropout, args.weight_dropout,
                               args.dropout_h, args.dropout_i, args.dropout_e)
model_eval = nlp.model.AWDRNN(args.model, len(vocab), args.emsize, args.nhid, args.nlayers,
                              args.tied, args.dropout, args.weight_dropout,
                              args.dropout_h, args.dropout_i, args.dropout_e,
                              params=model.collect_params())

model.initialize(mx.init.Xavier(), ctx=context)

model.hybridize(static_alloc=True)

Even I comment out all the apply_weight_drop in the code, the initialization error still occurred if args.tied is turned on. I print out some pdb info here:

(Pdb) model_eval.decoder[0]._params
awdrnn0_hybridsequential0_embedding0_ (
  Parameter awdrnn0_hybridsequential0_embedding0_weight (shape=(33278, 400), dtype=float32)
  Parameter awdrnn0_hybridsequential0_embedding0_bias (shape=(33278,), dtype=float32)
)
(Pdb) model_eval.decoder[0]._params.list_ctx()
*** RuntimeError: Parameter 'awdrnn0_hybridsequential0_embedding0_bias' has not been initialized
(Pdb) model_eval.decoder[0]._params._shared
awdrnn0_hybridsequential0_embedding0_ (
  Parameter awdrnn0_hybridsequential0_embedding0_weight (shape=(33278, 400), dtype=float32)
)
(Pdb) model_eval.decoder[0]._params._shared.list_ctx()
[cpu(0)]

The conclusion is: no matter whether we use weight drop or not, weight tied and weight sharing cannot be used simultaneously. Otherwise, there will be an initialization error. I will submit a separate issue for this in MXNET.

For the current weight drop bug, I think my fix is more simple and elegant. I will submit my PR very soon.

liuzh47 commented 4 years ago

I think we can add the weight drop flag to the Dense/RNN/Embedding layers. In that case, we can apply dropout inside hybrid_forward:

https://github.com/apache/incubator-mxnet/blob/2ad3ce408b08456bff0b92a82aa3c484adde29f2/python/mxnet/gluon/nn/basic_layers.py#L222-L228

It is similar to @leezu 's proposal. My point is that using weight drop in forward/hybrid_forward will lead to some unexpected gradient computation error. So I suggest we stick to the existing implementation of apply_weight_drop.

sxjscience commented 4 years ago

Let me try that. It should be fine (otherwise we will need to fix it in the MXNet side). The autograd error you’ve met is due to the fact that the AG engine in MXNet does not support assignment operations.

leezu commented 4 years ago

The key problem is that how to do gradient backpropogation if the weight is not updated? If some weights are dropped in the forward pass, they should never be updated during current iteration. I really doubt whether the gradient computation will still work correctly with your hack.

It's the same as Dropout. The gradient will be 0 for the dropped out parts. The proposal above is correct.

liuzh47 commented 4 years ago

It's the same as Dropout. The gradient will be 0 for the dropped out parts. The proposal above is correct.

I will check if gradient computation works correctly or not.

leezu commented 4 years ago

Ok, thank you. For your convenience, let me give a more complete code regarding above example: If you had before

def forward(self, x):
    ....
    embeddings = self.embedding(x)  # self.embedding is a gluon.nn.Embedding
    ....

then for weight dropout you need to

def forward(self, x):
    ....
    embedding_param = self.collect_params()['awdrnn0_hybridsequential0_embedding0_weight'].data(CTX)
    dropped_embedding = mx.nd.Dropout(embedding_param, 0.3)
    word_embedding = mx.nd.Embedding(x, dropped_embedding)
    ....

Further, you can optimize the computation by only performing dropout on the part of the weight that you are interested in:

def forward(self, x):
    ....
    embedding_param = self.collect_params()['awdrnn0_hybridsequential0_embedding0_weight'].data(CTX)
    word_embedding = mx.nd.Embedding(x, embedding_param)
    dropped_embedding = mx.nd.Dropout(word_embedding, 0.3)
    ....
liuzh47 commented 4 years ago

@leezu Thanks for the example. I have a few concerns about the your code.

def forward(self, x):
    ....
    embedding_param = self.collect_params()['awdrnn0_hybridsequential0_embedding0_weight'].data(CTX)
    word_embedding = mx.nd.Embedding(x, embedding_param)
    dropped_embedding = mx.nd.Dropout(word_embedding, 0.3)
    ....

I believe this piece of code is to apply dropout on the word embedding instead of embedding weights.

original embedding architecture looks like:

embedding = nn.HybridSequential()
with embedding.name_scope():
    embedding_block = nn.Embedding(self._vocab_size, self._embed_size,
                                   weight_initializer=init.Uniform(0.1))
    if self._drop_e:
        apply_weight_drop(embedding_block, 'weight', self._drop_e, axes=(1,))
    embedding.add(embedding_block)
    if self._drop_i:
        embedding.add(nn.Dropout(self._drop_i, axes=(0,)))
return embedding

So I think your implementation should be like

def forward(self, x):
    ....
    embedding_param = self.collect_params()['awdrnn0_hybridsequential0_embedding0_weight'].data(CTX)
    if self._drop_e:
        embedding_param = mx.nd.Dropout(embedding_param, self._drop_e)
    word_embedding = mx.nd.Embedding(x, embedding_param)
    if self._drop_i:
        word_embedding = mx.nd.Dropout(word_embedding, self._drop_i)
    ....

Each time if there is a dropout operation, do you have to reimplement the block structure in the forward function? It looks really impractical. what about weight drop on rnn cells? https://github.com/dmlc/gluon-nlp/blob/434187eb5e7bb898482021ffbc9c1648d0700608/src/gluonnlp/model/utils.py#L240

leezu commented 4 years ago

Yes, above example for applying dropout sparsely was not correct. You would need the np.unique with return_inverse=True to make sure the same dropout mask is applied to repeated words. But in fact, this kind of optimization is not crucial (it is currently not done either).

You're right about rnn cells. One way to avoid re-implementing the rnn cell is to add weight_drop support to the cells upstream in mxnet as @sxjscience suggested (https://github.com/dmlc/gluon-nlp/issues/1060#issuecomment-569169629).

So for now it may be simplest to keep apply_weight_drop around for the RNN use-case? When redesigning parameter handling in MXNet 2 we can revisit the design of apply_weight_drop and simplify the weight drop API / implementation.