Closed liuzh47 closed 4 years ago
I think the intention is to use topk(logp + Gumbel) to mimic sampling without replacement from the categorical distribution.
I think the intention is to use topk(logp + Gumbel) to mimic sampling without replacement from the categorical distribution.
log p is right.
Following https://github.com/google-research/electra/issues/41, F.npx.topk(sample_probs + gumbels was used to aviod deduplicated samples. During the pre-training, all the training corpora were processed into sentences with a length of
max_seq_length` (usually 512), so that there is no padding at the end of sentence in most cases. Therefore, it is my negligence that this issue you pointed out above were not covered.
Basically, you should use logp instead of P to use the gumbel trick.
Get Outlook for iOShttps://aka.ms/o0ukef
From: Zheyu Ye notifications@github.com Sent: Friday, August 28, 2020 2:46:03 AM To: dmlc/gluon-nlp gluon-nlp@noreply.github.com Cc: Xingjian SHI xshiab@connect.ust.hk; Comment comment@noreply.github.com Subject: Re: [dmlc/gluon-nlp] Dynamic masked position sampling of Electra (#1324)
Following google-research/electra#41https://github.com/google-research/electra/issues/41, F.npx.topk(sample_probs + gumbels was used to aviod deduplicated samples. During the pre-training, all the training corpora were processed into sentences with a length ofmax_seq_length` (usually 512), so that there is no padding at the end of sentence in most cases. Therefore, it is my negligence that this issue you pointed out above were not covered.
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/dmlc/gluon-nlp/issues/1324#issuecomment-682434283, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABHQH3UTTFHJEMSS5UP3VH3SC54FXANCNFSM4QN4W25A.
Description
The implementation of Gluonnlp Electra uses the a Gumbel variable for dynamic mask sampling as below
I am curious about why the implementation is using a Gumbel variable here. It may introduce error in sampling. For example, in one of my run the dynamic masking, I find that
sample_probs[10]
isIn this case, only the positions 1 to 13 should be considered as masked position candidates. However,
masked_positions
isclearly, indexes like 27 or 100 should not be considered masked position candidates.
Also, for each sequence, a separate masked position length should be considered.
For completeness, I also paste the
Gumbel[10]
hereEnvironment
We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below: