XSUM dataset differences with original

jpilaul commented 2 years ago

Hello, You shared the xsum dataset link here https://github.com/XiangLi1999/PrefixTuning/issues/2

However I see from the colab link https://worksheets.codalab.org/bundles/0x58f85171b43f4e61bf411c35faab369d and from the hyperparameters/data directory in https://worksheets.codalab.org/bundles/0xa3f0cd3c10c7490ab508a351968cbdcf that you have used xsum_news data. When I checked xsum_news, I found that the validation file has 7,186 examples. However, the original dataset has 11,327 examples. The test set is also different with 11,333 examples in xsum_news vs. 20,418 in the original xsum.

I was wondering if you could explain the differences in eval/test dataset sizes compared to the original and perhaps provide your script for preprocessing the original xsum.

Thanks!

jpilaul commented 2 years ago

Still can't reconcile the data differences. If I train, PrefixTuning with the original dataset (from Huggingface), my results are 3 percentage points lower then stated in the paper. However, if I use the data in the xsum_news directory from the codalab then it's 1 percentage point lower. I have added a screen shot of the dataset sizes here: Screen Shot 2021-11-04 at 10 53 03 AM

XiangLi1999 commented 2 years ago

Hi Jonathan, I think this is not a bug, we had some extrapolation experiments (testing out-of-distribution performance) where we split the xsum dataset differently (e.g., xsum_news is a dataset for the extrapolation experiment. Specifically, the training data contains {world,uk,business} news and the test data contains other news (e.g. health, tech)). But nice catch!

For the original xsum dataset, I also tuned the length penalty --length_pen and I think setting it to 0.8 in my case improve the performance.

XiangLi1999 commented 2 years ago

fyi, here is the screenshot of the last couple epoch's dev scores: and my model name is xsumprefixtune_y_200_act_cat_b=16-e=30_d=0.0_l=0.0_lr=5e-05_w=0.0_s=101_d=n_m=512, which I believe you can infer all the hyperparam setting.

python finetune.py --model_name_or_path facebook/bart-large --output_dir xsum_models/xsumprefixtune_y_200_act_cat_b=16-e=30_d=0.0_l=0.0_lr=5e-05_w=0.0_s=101_d=n_m=512 --data_dir xsum --tuning_mode prefixtune --preseqlen 200 --do_train --label_smoothing 0.0 --use_deep no --gpus 1 --learning_rate 5e-05 --train_batch_size 16 --eval_batch_size 16 --num_train_epochs 30 --optim_prefix yes --preseqlen 200 --prefix_mode activation --format_mode cat --gradient_accumulation_steps 1 --learning_rate 5e-05 --weight_decay 0.0 --seed 101 --mid_dim 512 --use_dropout no --prefix_dropout 0.0 --max_source_length 512 --max_target_length 60 --val_max_target_length 60 --test_max_target_length 100 --fp16 --fp16_opt_level O1

jpilaul commented 2 years ago

Yes, that's correct. I can confirm that I get almost the same results with length_pen of 0.8 and with your altered split of xsum. However, on the original xsum, prefix tuning performs 2-3 rouge points lower. Just a note here: in table 2 of your paper, you compared your model trained on your altered split of xsum against a fully finetuned BART large on the original xsum...

XiangLi1999 commented 2 years ago

emmm, actually the result on table 2 comes from prefix model trained on the original xsum dataset (the HuggingFace one, and let's denote it as "xsum"). And the result on table 3 "within news" comes from prefix model trained on my new split, denote as "xsum_news". You said you can replicate the test results e.g. 20.93 using the xsum_news split? I think this is quite surprising because "xsum_news" is supposed to be a harder data split than "xsum".

jpilaul commented 2 years ago

Yes I got really high scores for the OOD split ("xsum news" or table 3 results "within-news"). Note that on the aforementioned split, Bart Large fully fine-tuned is also much higher for me and slightly better than prefix tuning.

XiangLi1999 commented 2 years ago

hmm, this is interesting. Could I know your scripts and/or hyper-parameters? also just confirming that you are evaluating on the test set, not the dev set, right?

jpilaul commented 2 years ago

I was trying to replicate your validation set results above. I will try on test now.

XiangLi1999 commented 2 years ago

ohh, well this makes sense. the validation set is ID (in-distribution).

jpilaul commented 2 years ago

oh I see. It's still weird though. I get the same val results as you above on xsum_news but I get 2-3 percentage points lower on xsum_original on val. But, I think that if I just run everything on test, I will get similar scores than in your paper. On xsum_original, my val set scores were always lower than my test set scores.

I think that you kind of threw a curve ball at me since the validation scores posted above are meant to be for xsum_news when I was originally asking about xsum_original :)

I can double check by running my test scripts on both datasets.

XiangLi1999 commented 2 years ago

yea, sorry for the confusion. the CodaLab result was on xsum_news (I didn't realize that until you pointed out actually, Many thanks for that!!! I was running more OOD experiments later which changed my default...), but the screen shot I pasted above is on sum_original.

zhaone commented 2 years ago

@XiangLi1999 Hi, still the same issue. I use hyperpara you mentioned here

python train_bart.py --mode xsum --preseqlen 200 --do_train yes --fp16 yes --bsz 16 --epoch 30  --gradient_accumulation_step 1 --learning_rate 0.00005 --mid_dim 512 --n_gpu 1
xsum_models/xsumprefixtune_y_200_act_cat_b=16-e=30_d=0.0_l=0.0_lr=5e-05_w=0.0_s=101_d=n_m=512
python finetune.py --model_name_or_path facebook/bart-large --output_dir xsum_models/xsumprefixtune_y_200_act_cat_b=16-e=30_d=0.0_l=0.0_lr=5e-05_w=0.0_s=101_d=n_m=512 --data_dir xsum --tuning_mode prefixtune --preseqlen 200 --do_train --label_smoothing 0.0 --use_deep no --gpus 1 --learning_rate 5e-05 --train_batch_size 16 --eval_batch_size 16 --num_train_epochs 30 --optim_prefix yes --preseqlen 200 --prefix_mode activation --format_mode cat --gradient_accumulation_steps 1 --learning_rate 5e-05 --weight_decay 0.0 --seed 101 --mid_dim 512 --use_dropout no --prefix_dropout 0.0  --max_source_length 512 --max_target_length 60 --val_max_target_length 60 --test_max_target_length 100  --fp16 --fp16_opt_level O1

Still, I got 1-2 rouge points lower on xsum, original, val set. However, my val loss almost matches the screenshot you post. Here is my metrics.json

{
    "val_avg_loss": 1.6198463439941406,
    "val_avg_rouge1": 42.233881249999996,
    "val_avg_rouge2": 19.36146875,
    "val_avg_rougeL": 34.15171875,
    "val_avg_gen_time": 0.15285185538232327,
    "val_avg_gen_len": 35.65625,
    "step_count": 27
},
{
    "val_avg_loss": 1.6255970001220703,
    "val_avg_rouge1": 42.20043125,
    "val_avg_rouge2": 19.1729125,
    "val_avg_rougeL": 34.023271875,
    "val_avg_gen_time": 0.14824258256703615,
    "val_avg_gen_len": 35.90625,
    "step_count": 28
},
{
    "val_avg_loss": 1.6247661113739014,
    "val_avg_rouge1": 42.71155,
    "val_avg_rouge2": 19.746046874999998,
    "val_avg_rougeL": 34.4363625,
    "val_avg_gen_time": 0.14882883243262768,
    "val_avg_gen_len": 35.6875,
    "step_count": 29
},
{
    "val_avg_loss": 1.6258924007415771,
    "val_avg_rouge1": 42.674909375,
    "val_avg_rouge2": 19.538184375,
    "val_avg_rougeL": 34.334109375,
    "val_avg_gen_time": 0.14808641048148274,
    "val_avg_gen_len": 35.28125,
    "step_count": 30
},
{
    "val_avg_loss": 1.6260249614715576,
    "val_avg_rouge1": 42.778821875000006,
    "val_avg_rouge2": 19.570328125,
    "val_avg_rougeL": 34.363040625000004,
    "val_avg_gen_time": 0.1509597236290574,
    "val_avg_gen_len": 36.25,
    "step_count": 31
}

Any suggestion? Besides, have you ever tried training this model on multiple GPUs? On multiple GPUs, the performance is 2~3 points lower than that you post (using lr=0.00014). Can you give a set of hyperparameters suitable for DDP?

XiangLi1999 commented 2 years ago

I got 1-2 rouge points lower on xsum, original, val set. However, my val loss almost matches the screenshot you post.

I am a bit confused by the above sentence. The screenshot I posted is on xsum_original val set. So you are around .5 off. (where does the 1-2 rouge points lower come from?)

I never tried this code on DDP settings (due to resource constraints, sadly). I am not 100% sure, but I guess for DDP you don't need to scale lr down, I assume DDP would automatically do this? so lr should still be 0.00005.

zhaone commented 2 years ago

Sorry, 1-2 rouge points lower is compared to table2 of paper PREFIX(2%), the gap is 1, 1.3, 1.6 respectively, don’t know if these two experiments are comparable? Yes, for the screenshot you post, the gap is about 0.5. So I guess that the performance you report in table2 is on test set, and prefixlen=200 means PREFIX(2%)?

I notice that my gen_len is longer than that in your post by 1, should I add --length_pen 0.8 to improve the performance?

I've tried to keep lr=0.00005 but cannot replicate the performance on multiple GPUs. As far as I'm concerned, if I use 8 GPUs, the effective batch size should be 8 times than original set, so the lr should be increased? Also, I don't know if DDP does this automatically. Anyway, I'll try some other hyperparameters. If I can find a good one, I'll post it.

XiangLi1999 commented 2 years ago

I think if you get dev performance that matches the screenshot, you would get test performance matching table 2. These two should be correlated, but not exactly the same, one is on dev and one is on test. I did length penalty of 0.8 + a slightly different rouge evaluation to get deterministic scores. The evaluation script just make the result deterministic and it doesn't make a big change to the number:

def calculate_rouge(output_lns, reference_lns, use_stemmer=True):
    scorer = rouge_scorer.RougeScorer(ROUGE_KEYS, use_stemmer=use_stemmer)
    aggregator = scoring.BootstrapAggregator()
    my_determisitic_avg = defaultdict(list)

    for reference_ln, output_ln in zip(reference_lns, output_lns):
        scores = scorer.score(reference_ln, output_ln)
        for key, val in scores.items():
            # print(scores, val.fmeasure, val)
            my_determisitic_avg[key].append(val.fmeasure)
        aggregator.add_scores(scores)

    result = aggregator.aggregate()
    print(result)
    # print(my_determisitic_avg.keys())
    print('my determinsitic avg:')
    for k, v in my_determisitic_avg.items():
        v = np.array(v)
        print('{}={}'.format(k, round(v.mean()*100, 3 )))
    return {k: round(v.mid.fmeasure * 100, 4) for k, v in result.items()}

Yes, prefixlen=200 means Prefix(2%).

Yes, add --length_pen 0.8 to decode would probably improve the performance.

I think effective batch_size will be 8 times bigger, but I don't have a good intuition how it would relate to the learning rate. Intuitively, I guess keeping the old learning rate, or slightly increase it (0.00008) would still be fine, so I am not sure why it crashes for you...

zhaone commented 2 years ago

I see. Thanks for your reply!!! I'm trying your suggestion about the learning rate.

XiangLi1999 / PrefixTuning

XSUM dataset differences with original #20