Closed liuzh47 closed 4 years ago
This issue finds a fatal problem that makesvalid_candidates
invalidated
For a quick fix, we may change this section
We can change that to
valid_candidates = F.np.ones_like(input_ids, dtype=np.bool)
for ignore_token in ignore_tokens:
valid_candidates = valid_candidates - F.np.equal(input_ids, ignore_token)
In addition, I think it will be better to move it to the preprocessing phase.
For a quick fix, we may change this section
We can change that to
valid_candidates = F.np.ones_like(input_ids, dtype=np.bool) for ignore_token in ignore_tokens: valid_candidates = valid_candidates - F.np.equal(input_ids, ignore_token)
In addition, I think it will be better to move it to the preprocessing phase.
You cannot use minus here, some of values may end being negative numbers after that. Use multiply will solve the problem.
Description
valid_candidates
is used to mark the non-reserve tokens in the sequence in https://github.com/dmlc/gluon-nlp/blob/master/scripts/pretraining/pretraining_utils.py#L503. For example, for a sequence likeThe corresponding
valid_candidates
tokens should be like:In short,
valid_candidates
mask out tokens like[CLS]
[SEP]
and[PAD]
. Current implementation ofvalid_candidates
is wrong. It will always output sequences with all1
s.The problem is that the initialization of
valid_candidates
is wrong, as in https://github.com/dmlc/gluon-nlp/blob/master/scripts/pretraining/pretraining_utils.py#L497valid_candidates
is initialized to be all 1s. When doing subsequent operations, the value will never change.Environment
We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below: