[Script] Valid sequence length used in Electra dynamic masking

liuzh47 commented 4 years ago

Description

valid_candidates is used to mark the non-reserve tokens in the sequence in https://github.com/dmlc/gluon-nlp/blob/master/scripts/pretraining/pretraining_utils.py#L503. For example, for a sequence like

[CLS] Manhattan is the core of New York City.[SEP][PAD][PAD][PAD]

The corresponding valid_candidates tokens should be like:

01111111110000

In short, valid_candidates mask out tokens like [CLS] [SEP] and [PAD]. Current implementation of valid_candidates is wrong. It will always output sequences with all 1s.

The problem is that the initialization of valid_candidates is wrong, as in https://github.com/dmlc/gluon-nlp/blob/master/scripts/pretraining/pretraining_utils.py#L497

 valid_candidates = F.np.ones_like(input_ids, dtype=np.bool)

valid_candidates is initialized to be all 1s. When doing subsequent operations, the value will never change.

Environment

We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

# paste outputs here

zheyuye commented 4 years ago

This issue finds a fatal problem that makesvalid_candidates invalidated

sxjscience commented 4 years ago

For a quick fix, we may change this section

https://github.com/dmlc/gluon-nlp/blob/970318d5fcafd48843abdbaeb7aab0fc1061c901/scripts/pretraining/pretraining_utils.py#L497-L503

We can change that to

valid_candidates = F.np.ones_like(input_ids, dtype=np.bool) 
for ignore_token in ignore_tokens: 
    valid_candidates = valid_candidates - F.np.equal(input_ids, ignore_token)

In addition, I think it will be better to move it to the preprocessing phase.

liuzh47 commented 4 years ago

For a quick fix, we may change this section

https://github.com/dmlc/gluon-nlp/blob/970318d5fcafd48843abdbaeb7aab0fc1061c901/scripts/pretraining/pretraining_utils.py#L497-L503

We can change that to
valid_candidates = F.np.ones_like(input_ids, dtype=np.bool) 
for ignore_token in ignore_tokens: 
    valid_candidates = valid_candidates - F.np.equal(input_ids, ignore_token)
In addition, I think it will be better to move it to the preprocessing phase.

You cannot use minus here, some of values may end being negative numbers after that. Use multiply will solve the problem.

dmlc / gluon-nlp

[Script] Valid sequence length used in Electra dynamic masking #1321

Description

Environment