facebookresearch / unlikelihood_training

Neural Text Generation with Unlikelihood Training
Other
310 stars 45 forks source link

Maybe the BUG of the token-level unlikelyhood training loss #11

Open gmftbyGMFTBY opened 1 year ago

gmftbyGMFTBY commented 1 year ago

Hello, thank you for your wonderful work!

After carefully analyzing the token-level unlikelihood training loss, I think the batch-version unlikelihood training loss is different from the one defined in the paper.

In your paper, the negative candidates should be the context of the current token: unlikelyhood

But in your code, I notice that you simply flat all the tokens in a batch (may consist of N samples): https://github.com/facebookresearch/unlikelihood_training/blob/main/custom/candidate_penalty_ce_loss.py#L55

If the batch size if 1, the code is consistent with the definition in the paper. But if the batch size is larger than 1, the negative candidate of the sample, for example, the sample i>0, its negative candidates not only contains the previous tokens in sample i but also contains all the tokens in previous samples j<=i. Thus, in this case, the negative candidates are much larger.

Am I right? Looking forward to your response.

Sincerely.

Tian Lan

ashutoshbsathe commented 1 year ago

This observation is correct. The model is using a lot more negative tokens than it should be. Fixing the issue to only include negative candidates from the ith sample gives fairly modest improvements in my experience. I tried using (token level) unlikelihood objective on sequence length 256 on WikiText-103 with batch size 256 for 40k steps. To study text quality, I sample a prefix of length 32 from the test set and then generate the continuation with up to 128 tokens.

Model ppl ($\downarrow$) seq-rep-4 ($\downarrow$) uniq ($\uparrow$) mauve ($\uparrow$)
MLE 18.87 0.554 11.5k 0.956
UL (repo) 19.76 0.216 15.4k 0.988
UL (corrected) 19.35 0.406 13.7k 0.961
Human (from paper) - 0.005 18.9k 1.000

Interestingly, using a lesser number of negatives resulted in perplexity better than the repo version. The paper doesn't really go in-depth on perplexity either so I'm not sure which method (repo or corrected) is better. I think it is still worth using the objective but the gains in generation quality may not be as high as you might expect.