Closed AlvL1225 closed 8 months ago
Thanks! The correct approach should be to subtract the two distributions rather than adjust the value of the rejected elements. We have already adjusted the non-greedy code. Since sampling without replacement is performed here, a mask is used to adjust the draft distribution. The rest is consistent with the code in the screenshot you provided.
All the experimental results we provided were under the greedy decoding setting and are not affected.
However, in other implementations: like GPT-fast, or lucidrains implementation, the probability (GTP - Q )should be subtracted elementwisely but not only the rejected element?