Closed liuzh47 closed 4 years ago
One workaround to the above problem is to initialize the model
parameters before applying weight sharing:
model = nlp.model.train.AWDRNN(args.model, len(vocab), args.emsize, args.nhid, args.nlayers,
args.tied, args.dropout, args.weight_dropout,
args.dropout_h, args.dropout_i, args.dropout_e)
model.initialize(mx.init.Xavier(), ctx=context)
model_eval = nlp.model.AWDRNN(args.model, len(vocab), args.emsize, args.nhid, args.nlayers,
args.tied, args.dropout, args.weight_dropout,
args.dropout_h, args.dropout_i, args.dropout_e,
params=model.collect_params())
I suggest we drop the WeightDropParameter
. Instead users can manually call Dropout
in forward.
Understanding WeightDropParameter
requires intricate familiarity with the way Gluon handles Parameters and Blocks, and most users will have a hard time to understand it. However, our goal is to have easy to understand code. Thus in it's current implementation, WeightDropParameter
doesn't align with our goals.
Further, as pointed out by @liuzh91, the implementation is currently broken.
What do others think?
The above workaround does not work if we have args.tied
turned on. In this case, after I ran the code check_initialized(model_eval)
, I got the following error message:
*** RuntimeError: Parameter 'awdrnn0_hybridsequential0_embedding0_bias' has not been initialized
It is exactly the output layer of the model_eval
that not been initialized. model_eval
contains the following parameters:
model_eval.collect_params()
awdrnn0_ (
WeightDropParameter awdrnn0_hybridsequential0_embedding0_weight (shape=(33278, 400), dtype=float32, rate=0.1, mode=training)
Parameter awdrnn0_hybridsequential1_lstm0_l0_i2h_weight (shape=(4600, 400), dtype=float32)
WeightDropParameter awdrnn0_hybridsequential1_lstm0_l0_h2h_weight (shape=(4600, 1150), dtype=float32, rate=0.5, mode=training)
Parameter awdrnn0_hybridsequential1_lstm0_l0_i2h_bias (shape=(4600,), dtype=float32)
Parameter awdrnn0_hybridsequential1_lstm0_l0_h2h_bias (shape=(4600,), dtype=float32)
Parameter awdrnn0_hybridsequential1_lstm1_l0_i2h_weight (shape=(4600, 1150), dtype=float32)
WeightDropParameter awdrnn0_hybridsequential1_lstm1_l0_h2h_weight (shape=(4600, 1150), dtype=float32, rate=0.5, mode=training)
Parameter awdrnn0_hybridsequential1_lstm1_l0_i2h_bias (shape=(4600,), dtype=float32)
Parameter awdrnn0_hybridsequential1_lstm1_l0_h2h_bias (shape=(4600,), dtype=float32)
Parameter awdrnn0_hybridsequential1_lstm2_l0_i2h_weight (shape=(1600, 1150), dtype=float32)
WeightDropParameter awdrnn0_hybridsequential1_lstm2_l0_h2h_weight (shape=(1600, 400), dtype=float32, rate=0.5, mode=training)
Parameter awdrnn0_hybridsequential1_lstm2_l0_i2h_bias (shape=(1600,), dtype=float32)
Parameter awdrnn0_hybridsequential1_lstm2_l0_h2h_bias (shape=(1600,), dtype=float32)
Parameter awdrnn0_hybridsequential0_embedding0_bias (shape=(33278,), dtype=float32)
)
WeightDropParameter awdrnn0_hybridsequential0_embedding0_weight
and Parameter awdrnn0_hybridsequential0_embedding0_bias
have their weights tied up. WeightDropParameter awdrnn0_hybridsequential0_embedding0_weight
is properly initialized whereas Parameter awdrnn0_hybridsequential0_embedding0_bias
is not initialized. I guess it is WeightDropParameter
that causes the initialization error.
If I applied weight dropout during the forward pass:
embedding_param = self.collect_params()['awdrnn0_hybridsequential0_embedding0_weight']
embedding_param.set_data(mx.nd.Dropout(embedding_param.data(), 0.3))
I got the following error msg:
*** mxnet.base.MXNetError: [10:20:43] src/imperative/imperative.cc:203: Check failed: AGInfo::IsNone(*output): Assigning to NDArrays that are already in a computational graph will cause undefined behavior when evaluating gradients. Please call backward first to clear the graph or do this out side of a record section. Also note that you cannot use inplace operations like +=, *=, relu(x, out=x), y[idx]=x, etc inside a record section.
Note that my code is wrapped in the autograd.record()
. The error suggests I apply the weight dropout outside the record
section, but it is impractical in this case. Is there any workaround for this error?
Yes, this will not work indeed.
You need to do something like
def forward(self, ...):
....
embedding_param = self.collect_params()['awdrnn0_hybridsequential0_embedding0_weight'].data(CTX)
embedding_param = mx.nd.Dropout(embedding_param, 0.3)
....
Yes, this will not work indeed.
You need to do something like
def forward(self, ...): .... embedding_param = self.collect_params()['awdrnn0_hybridsequential0_embedding0_weight'].data(CTX) embedding_param = mx.nd.Dropout(embedding_param, 0.3) ....
It should use slice assign embedding_param[:]
instead of embedding_param
. Otherwise, the weight self.collect_params()['awdrnn0_hybridsequential0_embedding0_weight']
does not change at all. Concretely, I implement the following code:
embedding_param = self.collect_params(['awdrnn0_hybridsequential0_embedding0_weight'].data(CTX)
embedding_param[:] = mx.nd.Dropout(embedding_param, 0.3)
But if I do it in this way, I still get the same MXNetError
:
*** mxnet.base.MXNetError: [03:55:51] src/imperative/imperative.cc:203: Check failed: AGInfo::IsNone(*output): Assigning to NDArrays that are already in a computational graph will cause undefined behavior when evaluating gradients. Please call backward first to clear the graph or do this out side of a record section. Also note that you cannot use inplace operations like +=, *=, relu(x, out=x), y[idx]=x, etc inside a record section.
I suggest we drop the
WeightDropParameter
. Instead users can manually callDropout
in forward.Understanding
WeightDropParameter
requires intricate familiarity with the way Gluon handles Parameters and Blocks, and most users will have a hard time to understand it. However, our goal is to have easy to understand code. Thus in it's current implementation,WeightDropParameter
doesn't align with our goals.Further, as pointed out by @liuzh91, the implementation is currently broken.
What do others think?
It seems we cannot manually assign weights during the record session. I found a bunch of similar assignment errors from the internet, e.g., https://discuss.mxnet.io/t/autograd-record-error-about-assigning-to-ndarray/322. https://github.com/apache/incubator-mxnet/issues/9989
Alternatively, I have tried weight assignment outside the record session suggested by the error msg. However, mx.nd.Dropout
deferred the evaluation until entering the with autograd.record()
scope.
Perhaps we still need to switch back to the apply_weight_drop()
function and fix the shared parameters problem there. In apply_weight_drop()
, it technically replaces every param
matching local_param_regex
with a new dropped_param
even the param
is an instance of WeightDropParameter
. If my understanding is correct, assume that the param
is already an instance of WeightDropParameter
by weight sharing, we do not need to create a new WeightDropParameter dropped_param
and replace every occurrence of the param
. The following assignment is therefore unnecessary:
https://github.com/dmlc/gluon-nlp/blob/434187eb5e7bb898482021ffbc9c1648d0700608/src/gluonnlp/model/utils.py#L92
It can be replaced by
dropped_param = param
[UPDATE] If I add a check before creating a new dropped_param
in L92:
if isinstance(param, WeightDropParameter):
continue
The initialization works if I don't have args.tied
turned on. If args.tied
turned on, the tied weight parameters Parameter awdrnn0_hybridsequential0_embedding0_bias
still suffer the initialization problem. The bug is triggered by the following code:
with output.name_scope():
if self._tie_weights:
output.add(nn.Dense(self._vocab_size, flatten=False,
params=self.embedding[0].params))
else:
output.add(nn.Dense(self._vocab_size, flatten=False))
[UPDATE]2: After some investigation, I found the following debugging info:
(Pdb) model_eval.decoder[0]._params._shared.list_ctx()
[cpu(0)]
(Pdb) model_eval.decoder[0]._params.list_ctx()
*** RuntimeError: Parameter 'awdrnn0_hybridsequential0_embedding0_bias' has not been initialized
I believe params
should always have the same context with the params._shared
. There is something messed up here.
[UPDATE]3: I am confused with the weight tied here. The two tied layers has different initialized values:
(Pdb) model.collect_params()['awdrnn0_hybridsequential0_embedding0_weight']._data
[
[[ 0.0097627 0.01856892 0.04303787 ... -0.08871634 -0.01311667
-0.00243246]
[-0.03764082 0.07620091 0.03926869 ... 0.00636984 -0.06295353
0.06907154]
[-0.0197481 0.00725491 0.08585829 ... -0.09896208 0.09590539
0.09404436]
...
[-0.05431841 0.02288689 0.05555419 ... 0.05949707 0.00848633
-0.0083269 ]
[ 0.04172196 0.00805981 -0.03484908 ... -0.08370983 0.04527131
0.05345441]
[ 0.07405332 0.05066857 0.02247771 ... -0.00969069 0.03976088
0.00947104]]
<NDArray 33278x400 @cpu(0)>]
(Pdb) model.collect_params()['awdrnn0_hybridsequential0_embedding0_bias']._data
[
[0. 0. 0. ... 0. 0. 0.]
<NDArray 33278 @cpu(0)>]
Why do you want to assign the weight? You can access the non-weight-dropped weight and apply dropout to the sliced part that you are interested. It's not necessary nor possible to modify the actual parameter via assignment in the forward pass.
The code I posted above, which doesn't do any in-place modification of the weight is correct. You just need to keep the variable around so that you use the same dropped out version at any location in the forward pass.
Why do you want to assign the weight? You can access the non-weight-dropped weight and apply dropout to the sliced part that you are interested. It's not necessary nor possible to modify the actual parameter via assignment in the forward pass.
The code I posted above, which doesn't do any in-place modification of the weight is correct. You just need to keep the variable around so that you use the same dropped out version at any location in the forward pass.
The key problem is that how to do gradient backpropogation if the weight is not updated? If some weights are dropped in the forward pass, they should never be updated during current iteration. I really doubt whether the gradient computation will still work correctly with your hack.
Also, keeping network parameters in a temporary variable is also not memory efficient.
I think we can add the weight drop flag to the Dense/RNN/Embedding layers. In that case, we can apply dropout inside hybrid_forward:
I finally figured out there is a mxnet weight sharing bug. When weight sharing is used with tied weight, something unexpected will occur. For this piece of code:
model = nlp.model.train.AWDRNN(args.model, len(vocab), args.emsize, args.nhid, args.nlayers,
args.tied, args.dropout, args.weight_dropout,
args.dropout_h, args.dropout_i, args.dropout_e)
model_eval = nlp.model.AWDRNN(args.model, len(vocab), args.emsize, args.nhid, args.nlayers,
args.tied, args.dropout, args.weight_dropout,
args.dropout_h, args.dropout_i, args.dropout_e,
params=model.collect_params())
model.initialize(mx.init.Xavier(), ctx=context)
model.hybridize(static_alloc=True)
Even I comment out all the apply_weight_drop
in the code, the initialization error still occurred if args.tied
is turned on. I print out some pdb info here:
(Pdb) model_eval.decoder[0]._params
awdrnn0_hybridsequential0_embedding0_ (
Parameter awdrnn0_hybridsequential0_embedding0_weight (shape=(33278, 400), dtype=float32)
Parameter awdrnn0_hybridsequential0_embedding0_bias (shape=(33278,), dtype=float32)
)
(Pdb) model_eval.decoder[0]._params.list_ctx()
*** RuntimeError: Parameter 'awdrnn0_hybridsequential0_embedding0_bias' has not been initialized
(Pdb) model_eval.decoder[0]._params._shared
awdrnn0_hybridsequential0_embedding0_ (
Parameter awdrnn0_hybridsequential0_embedding0_weight (shape=(33278, 400), dtype=float32)
)
(Pdb) model_eval.decoder[0]._params._shared.list_ctx()
[cpu(0)]
The conclusion is: no matter whether we use weight drop or not, weight tied and weight sharing cannot be used simultaneously. Otherwise, there will be an initialization error. I will submit a separate issue for this in MXNET.
For the current weight drop bug, I think my fix is more simple and elegant. I will submit my PR very soon.
I think we can add the weight drop flag to the Dense/RNN/Embedding layers. In that case, we can apply dropout inside hybrid_forward:
It is similar to @leezu 's proposal. My point is that using weight drop in forward
/hybrid_forward
will lead to some unexpected gradient computation error. So I suggest we stick to the existing implementation of apply_weight_drop
.
Let me try that. It should be fine (otherwise we will need to fix it in the MXNet side). The autograd error you’ve met is due to the fact that the AG engine in MXNet does not support assignment operations.
The key problem is that how to do gradient backpropogation if the weight is not updated? If some weights are dropped in the forward pass, they should never be updated during current iteration. I really doubt whether the gradient computation will still work correctly with your hack.
It's the same as Dropout. The gradient will be 0 for the dropped out parts. The proposal above is correct.
It's the same as Dropout. The gradient will be 0 for the dropped out parts. The proposal above is correct.
I will check if gradient computation works correctly or not.
Ok, thank you. For your convenience, let me give a more complete code regarding above example: If you had before
def forward(self, x):
....
embeddings = self.embedding(x) # self.embedding is a gluon.nn.Embedding
....
then for weight dropout you need to
def forward(self, x):
....
embedding_param = self.collect_params()['awdrnn0_hybridsequential0_embedding0_weight'].data(CTX)
dropped_embedding = mx.nd.Dropout(embedding_param, 0.3)
word_embedding = mx.nd.Embedding(x, dropped_embedding)
....
Further, you can optimize the computation by only performing dropout on the part of the weight that you are interested in:
def forward(self, x):
....
embedding_param = self.collect_params()['awdrnn0_hybridsequential0_embedding0_weight'].data(CTX)
word_embedding = mx.nd.Embedding(x, embedding_param)
dropped_embedding = mx.nd.Dropout(word_embedding, 0.3)
....
@leezu Thanks for the example. I have a few concerns about the your code.
def forward(self, x):
....
embedding_param = self.collect_params()['awdrnn0_hybridsequential0_embedding0_weight'].data(CTX)
word_embedding = mx.nd.Embedding(x, embedding_param)
dropped_embedding = mx.nd.Dropout(word_embedding, 0.3)
....
I believe this piece of code is to apply dropout on the word embedding instead of embedding weights.
original embedding architecture looks like:
embedding = nn.HybridSequential()
with embedding.name_scope():
embedding_block = nn.Embedding(self._vocab_size, self._embed_size,
weight_initializer=init.Uniform(0.1))
if self._drop_e:
apply_weight_drop(embedding_block, 'weight', self._drop_e, axes=(1,))
embedding.add(embedding_block)
if self._drop_i:
embedding.add(nn.Dropout(self._drop_i, axes=(0,)))
return embedding
So I think your implementation should be like
def forward(self, x):
....
embedding_param = self.collect_params()['awdrnn0_hybridsequential0_embedding0_weight'].data(CTX)
if self._drop_e:
embedding_param = mx.nd.Dropout(embedding_param, self._drop_e)
word_embedding = mx.nd.Embedding(x, embedding_param)
if self._drop_i:
word_embedding = mx.nd.Dropout(word_embedding, self._drop_i)
....
Each time if there is a dropout operation, do you have to reimplement the block structure in the forward function? It looks really impractical. what about weight drop on rnn cells? https://github.com/dmlc/gluon-nlp/blob/434187eb5e7bb898482021ffbc9c1648d0700608/src/gluonnlp/model/utils.py#L240
Yes, above example for applying dropout sparsely was not correct. You would need the np.unique
with return_inverse=True
to make sure the same dropout mask is applied to repeated words.
But in fact, this kind of optimization is not crucial (it is currently not done either).
You're right about rnn cells. One way to avoid re-implementing the rnn cell is to add weight_drop support to the cells upstream in mxnet as @sxjscience suggested (https://github.com/dmlc/gluon-nlp/issues/1060#issuecomment-569169629).
So for now it may be simplest to keep apply_weight_drop
around for the RNN use-case? When redesigning parameter handling in MXNet 2 we can revisit the design of apply_weight_drop
and simplify the weight drop API / implementation.
Description
Parameter sharing of
WeightDropParameter
is broken. I apply weight sharing betweenAWDRNN
andtrain.AWDRNN
as followed:Log Message
After I ran the above code, I got the following message:
It appeared that
model_eval
is not properly initialized. After an inspection, we found it is theWeightDropParameter
that not initialized.Environment
My environment specs: