Dynamic masked position sampling of Electra

liuzh47 commented 4 years ago

Description

The implementation of Gluonnlp Electra uses the a Gumbel variable for dynamic mask sampling as below

masked_positions = F.npx.topk(
            sample_probs + gumbels, k=N, axis=-1, ret_typ='indices', dtype=np.int32)

I am curious about why the implementation is using a Gumbel variable here. It may introduce error in sampling. For example, in one of my run the dynamic masking, I find that sample_probs[10] is

array([0.    , 0.07692308, 0.07692308, 0.07692308, 0.07692308,
       0.07692308, 0.07692308, 0.07692308, 0.07692308, 0.07692308,
       0.07692308, 0.07692308, 0.07692308, 0.07692308, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       ...
       0.        , 0.        , 0.        , 0.        , 0.        ]

In this case, only the positions 1 to 13 should be considered as masked position candidates. However, masked_positions is

array([ 27, 100,  87,  60,  81,  52,  98,  25,  12,  62,  88,  85,  61,
        54,  55,  69,  47,   6,  59], dtype=int32, ctx=gpu(0))

clearly, indexes like 27 or 100 should not be considered masked position candidates.

Also, for each sequence, a separate masked position length should be considered.

For completeness, I also paste the Gumbel[10] here

array([-5.6676561e-01, -4.4656819e-01, -2.7907902e-01, -3.2573196e-01,
       -4.4453812e-01,  3.2523954e-01,  2.0374992e+00, -1.6541387e-01,
        7.7737647e-01, -5.7431495e-01, -5.1315073e-02, -8.1144041e-01,
        2.9163866e+00, -6.2848467e-01,  6.0831666e-01, -9.7478211e-01,
        1.3809233e+00, -7.9744577e-01, -2.8823370e-01,  1.8911676e+00,
       -8.5267669e-01,  1.4505583e+00,  2.0499854e-01,  8.3566004e-01,
       -6.1793911e-01,  3.0691831e+00, -1.1203859e-01,  6.7471809e+00,
        8.0651736e-01,  1.8440349e+00, -3.6532184e-01, -4.2094207e-01,
        1.8653628e+00,  3.7640807e-01, -7.3615271e-01,  1.2071087e+00
       ...

Environment

We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

# paste outputs here

sxjscience commented 4 years ago

I think the intention is to use topk(logp + Gumbel) to mimic sampling without replacement from the categorical distribution.

liuzh47 commented 4 years ago

I think the intention is to use topk(logp + Gumbel) to mimic sampling without replacement from the categorical distribution.

log p is right.

zheyuye commented 4 years ago

Following https://github.com/google-research/electra/issues/41, F.npx.topk(sample_probs + gumbels was used to aviod deduplicated samples. During the pre-training, all the training corpora were processed into sentences with a length ofmax_seq_length` (usually 512), so that there is no padding at the end of sentence in most cases. Therefore, it is my negligence that this issue you pointed out above were not covered.

sxjscience commented 4 years ago

Basically, you should use logp instead of P to use the gumbel trick.

Get Outlook for iOShttps://aka.ms/o0ukef

From: Zheyu Ye notifications@github.com Sent: Friday, August 28, 2020 2:46:03 AM To: dmlc/gluon-nlp gluon-nlp@noreply.github.com Cc: Xingjian SHI xshiab@connect.ust.hk; Comment comment@noreply.github.com Subject: Re: [dmlc/gluon-nlp] Dynamic masked position sampling of Electra (#1324)

Following google-research/electra#41https://github.com/google-research/electra/issues/41, F.npx.topk(sample_probs + gumbels was used to aviod deduplicated samples. During the pre-training, all the training corpora were processed into sentences with a length ofmax_seq_length` (usually 512), so that there is no padding at the end of sentence in most cases. Therefore, it is my negligence that this issue you pointed out above were not covered.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/dmlc/gluon-nlp/issues/1324#issuecomment-682434283, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABHQH3UTTFHJEMSS5UP3VH3SC54FXANCNFSM4QN4W25A.

sxjscience commented 4 years ago

Closed by https://github.com/dmlc/gluon-nlp/pull/1323

dmlc / gluon-nlp

Dynamic masked position sampling of Electra #1324

Description

Environment