bloomberg / MixCE-acl2023

Implementation of MixCE method described in ACL 2023 paper by Zhang et al.
Apache License 2.0
19 stars 3 forks source link

Question Regarding the Maximum Length for the MAUVE Evaluaiton #2

Open chenyangh opened 1 month ago

chenyangh commented 1 month ago

Hello,

I have a few questions while replicating the numbers using the provided checkpoints.

  1. I wonder how, in Tables 2 and 3, what are the max_length.

    So far, I have only tested the WikiText checkpoint trained with MLE. The observation is that the MAUVE scores are quite different from the tables and are heavily dependent on the max_length for the evaluation. In addition, the generated samples are much shorter than human references. My settings: max length for generation is set to 512 + prompt_len; top-P is set to 0.9.

  2. How do we solve the "ERROR: Can't get enough samples!" error when evaluating c-MAUVE?

Thanks for making the code public.

ZhangShiyue commented 1 week ago

Hi, thanks for reaching out! Sorry for the delay.

  1. The default max length is 512 and we analyze the impact of different max length is Figure 4 in the Appendix. Note that for Table 2, our top-p is always 1, i.e., unbiased sampling. For Table 3, the numbers correspond to the best p reported in the table.

  2. As shown in Table 8 and explained in Appendix D.4, for WikiText, we can only compute c-mauve_100 because the we could not get 10K 200-token fragments from WikiText.

Hope this helps!