Maybe the BUG of the token-level unlikelyhood training loss

facebookresearch / unlikelihood_training

Neural Text Generation with Unlikelihood Training

Other

310 stars 45 forks source link

Hello, thank you for your wonderful work!

After carefully analyzing the token-level unlikelihood training loss, I think the batch-version unlikelihood training loss is different from the one defined in the paper.

In your paper, the negative candidates should be the context of the current token: unlikelyhood

But in your code, I notice that you simply flat all the tokens in a batch (may consist of N samples): https://github.com/facebookresearch/unlikelihood_training/blob/main/custom/candidate_penalty_ce_loss.py#L55

If the batch size if 1, the code is consistent with the definition in the paper. But if the batch size is larger than 1, the negative candidate of the sample, for example, the sample i>0, its negative candidates not only contains the previous tokens in sample i but also contains all the tokens in previous samples j<=i. Thus, in this case, the negative candidates are much larger.

Am I right? Looking forward to your response.

Sincerely.

Tian Lan

Model	ppl ($\downarrow$)	seq-rep-4 ($\downarrow$)	uniq ($\uparrow$)	mauve ($\uparrow$)
MLE	18.87	0.554	11.5k	0.956
UL (repo)	19.76	0.216	15.4k	0.988
UL (corrected)	19.35	0.406	13.7k	0.961
Human (from paper)	-	0.005	18.9k	1.000

Model

ppl ($\downarrow$)

seq-rep-4 ($\downarrow$)

uniq ($\uparrow$)

mauve ($\uparrow$)

MLE

18.87

0.554

11.5k

0.956

UL (repo)

19.76

0.216

15.4k

0.988

UL (corrected)

19.35

0.406

13.7k

0.961

Human (from paper)

0.005

18.9k

1.000

facebookresearch / unlikelihood_training

Maybe the BUG of the token-level unlikelyhood training loss #11