Closed jpilaul closed 2 years ago
Still can't reconcile the data differences. If I train, PrefixTuning with the original dataset (from Huggingface), my results are 3 percentage points lower then stated in the paper. However, if I use the data in the xsum_news
directory from the codalab then it's 1 percentage point lower. I have added a screen shot of the dataset sizes here:
Hi Jonathan, I think this is not a bug, we had some extrapolation experiments (testing out-of-distribution performance) where we split the xsum dataset differently (e.g., xsum_news is a dataset for the extrapolation experiment. Specifically, the training data contains {world,uk,business} news and the test data contains other news (e.g. health, tech)). But nice catch!
For the original xsum dataset, I also tuned the length penalty --length_pen and I think setting it to 0.8 in my case improve the performance.
fyi, here is the screenshot of the last couple epoch's dev scores: and my model name is xsumprefixtune_y_200_act_cat_b=16-e=30_d=0.0_l=0.0_lr=5e-05_w=0.0_s=101_d=n_m=512, which I believe you can infer all the hyperparam setting.
python finetune.py --model_name_or_path facebook/bart-large --output_dir xsum_models/xsumprefixtune_y_200_act_cat_b=16-e=30_d=0.0_l=0.0_lr=5e-05_w=0.0_s=101_d=n_m=512 --data_dir xsum --tuning_mode prefixtune --preseqlen 200 --do_train --label_smoothing 0.0 --use_deep no --gpus 1 --learning_rate 5e-05 --train_batch_size 16 --eval_batch_size 16 --num_train_epochs 30 --optim_prefix yes --preseqlen 200 --prefix_mode activation --format_mode cat --gradient_accumulation_steps 1 --learning_rate 5e-05 --weight_decay 0.0 --seed 101 --mid_dim 512 --use_dropout no --prefix_dropout 0.0 --max_source_length 512 --max_target_length 60 --val_max_target_length 60 --test_max_target_length 100 --fp16 --fp16_opt_level O1
Yes, that's correct. I can confirm that I get almost the same results with length_pen
of 0.8 and with your altered split of xsum
. However, on the original xsum
, prefix tuning performs 2-3 rouge points lower. Just a note here: in table 2 of your paper, you compared your model trained on your altered split of xsum against a fully finetuned BART large on the original xsum...
emmm, actually the result on table 2 comes from prefix model trained on the original xsum dataset (the HuggingFace one, and let's denote it as "xsum"). And the result on table 3 "within news" comes from prefix model trained on my new split, denote as "xsum_news". You said you can replicate the test results e.g. 20.93 using the xsum_news split? I think this is quite surprising because "xsum_news" is supposed to be a harder data split than "xsum".
Yes I got really high scores for the OOD split ("xsum news" or table 3 results "within-news"). Note that on the aforementioned split, Bart Large fully fine-tuned is also much higher for me and slightly better than prefix tuning.
hmm, this is interesting. Could I know your scripts and/or hyper-parameters? also just confirming that you are evaluating on the test set, not the dev set, right?
I was trying to replicate your validation set results above. I will try on test now.
ohh, well this makes sense. the validation set is ID (in-distribution).
oh I see. It's still weird though. I get the same val results as you above on xsum_news
but I get 2-3 percentage points lower on xsum_original
on val. But, I think that if I just run everything on test, I will get similar scores than in your paper. On xsum_original
, my val set scores were always lower than my test set scores.
I think that you kind of threw a curve ball at me since the validation scores posted above are meant to be for xsum_news
when I was originally asking about xsum_original
:)
I can double check by running my test scripts on both datasets.
yea, sorry for the confusion. the CodaLab result was on xsum_news (I didn't realize that until you pointed out actually, Many thanks for that!!! I was running more OOD experiments later which changed my default...), but the screen shot I pasted above is on sum_original.
@XiangLi1999 Hi, still the same issue. I use hyperpara you mentioned here
python train_bart.py --mode xsum --preseqlen 200 --do_train yes --fp16 yes --bsz 16 --epoch 30 --gradient_accumulation_step 1 --learning_rate 0.00005 --mid_dim 512 --n_gpu 1
xsum_models/xsumprefixtune_y_200_act_cat_b=16-e=30_d=0.0_l=0.0_lr=5e-05_w=0.0_s=101_d=n_m=512
python finetune.py --model_name_or_path facebook/bart-large --output_dir xsum_models/xsumprefixtune_y_200_act_cat_b=16-e=30_d=0.0_l=0.0_lr=5e-05_w=0.0_s=101_d=n_m=512 --data_dir xsum --tuning_mode prefixtune --preseqlen 200 --do_train --label_smoothing 0.0 --use_deep no --gpus 1 --learning_rate 5e-05 --train_batch_size 16 --eval_batch_size 16 --num_train_epochs 30 --optim_prefix yes --preseqlen 200 --prefix_mode activation --format_mode cat --gradient_accumulation_steps 1 --learning_rate 5e-05 --weight_decay 0.0 --seed 101 --mid_dim 512 --use_dropout no --prefix_dropout 0.0 --max_source_length 512 --max_target_length 60 --val_max_target_length 60 --test_max_target_length 100 --fp16 --fp16_opt_level O1
Still, I got 1-2 rouge points lower on xsum, original, val
set. However, my val loss almost matches the screenshot you post. Here is my metrics.json
{
"val_avg_loss": 1.6198463439941406,
"val_avg_rouge1": 42.233881249999996,
"val_avg_rouge2": 19.36146875,
"val_avg_rougeL": 34.15171875,
"val_avg_gen_time": 0.15285185538232327,
"val_avg_gen_len": 35.65625,
"step_count": 27
},
{
"val_avg_loss": 1.6255970001220703,
"val_avg_rouge1": 42.20043125,
"val_avg_rouge2": 19.1729125,
"val_avg_rougeL": 34.023271875,
"val_avg_gen_time": 0.14824258256703615,
"val_avg_gen_len": 35.90625,
"step_count": 28
},
{
"val_avg_loss": 1.6247661113739014,
"val_avg_rouge1": 42.71155,
"val_avg_rouge2": 19.746046874999998,
"val_avg_rougeL": 34.4363625,
"val_avg_gen_time": 0.14882883243262768,
"val_avg_gen_len": 35.6875,
"step_count": 29
},
{
"val_avg_loss": 1.6258924007415771,
"val_avg_rouge1": 42.674909375,
"val_avg_rouge2": 19.538184375,
"val_avg_rougeL": 34.334109375,
"val_avg_gen_time": 0.14808641048148274,
"val_avg_gen_len": 35.28125,
"step_count": 30
},
{
"val_avg_loss": 1.6260249614715576,
"val_avg_rouge1": 42.778821875000006,
"val_avg_rouge2": 19.570328125,
"val_avg_rougeL": 34.363040625000004,
"val_avg_gen_time": 0.1509597236290574,
"val_avg_gen_len": 36.25,
"step_count": 31
}
Any suggestion?
Besides, have you ever tried training this model on multiple GPUs? On multiple GPUs, the performance is 2~3 points lower than that you post (using lr=0.00014
). Can you give a set of hyperparameters suitable for DDP?
I got 1-2 rouge points lower on xsum, original, val set. However, my val loss almost matches the screenshot you post.
I am a bit confused by the above sentence. The screenshot I posted is on xsum_original val set. So you are around .5 off. (where does the 1-2 rouge points lower come from?)
I never tried this code on DDP settings (due to resource constraints, sadly). I am not 100% sure, but I guess for DDP you don't need to scale lr down, I assume DDP would automatically do this? so lr should still be 0.00005.
Sorry, 1-2 rouge points lower
is compared to table2 of paper PREFIX(2%)
, the gap is 1, 1.3, 1.6
respectively, don’t know if these two experiments are comparable? Yes, for the screenshot you post, the gap is about 0.5
. So I guess that the performance you report in table2 is on test
set, and prefixlen=200
means PREFIX(2%)
?
I notice that my gen_len
is longer than that in your post by 1, should I add --length_pen 0.8
to improve the performance?
I've tried to keep lr=0.00005
but cannot replicate the performance on multiple GPUs. As far as I'm concerned, if I use 8 GPUs, the effective batch size should be 8 times than original set, so the lr should be increased? Also, I don't know if DDP does this automatically. Anyway, I'll try some other hyperparameters. If I can find a good one, I'll post it.
I think if you get dev performance that matches the screenshot, you would get test performance matching table 2. These two should be correlated, but not exactly the same, one is on dev and one is on test. I did length penalty of 0.8 + a slightly different rouge evaluation to get deterministic scores. The evaluation script just make the result deterministic and it doesn't make a big change to the number:
def calculate_rouge(output_lns, reference_lns, use_stemmer=True):
scorer = rouge_scorer.RougeScorer(ROUGE_KEYS, use_stemmer=use_stemmer)
aggregator = scoring.BootstrapAggregator()
my_determisitic_avg = defaultdict(list)
for reference_ln, output_ln in zip(reference_lns, output_lns):
scores = scorer.score(reference_ln, output_ln)
for key, val in scores.items():
# print(scores, val.fmeasure, val)
my_determisitic_avg[key].append(val.fmeasure)
aggregator.add_scores(scores)
result = aggregator.aggregate()
print(result)
# print(my_determisitic_avg.keys())
print('my determinsitic avg:')
for k, v in my_determisitic_avg.items():
v = np.array(v)
print('{}={}'.format(k, round(v.mean()*100, 3 )))
return {k: round(v.mid.fmeasure * 100, 4) for k, v in result.items()}
Yes, prefixlen=200 means Prefix(2%).
Yes, add --length_pen 0.8 to decode would probably improve the performance.
I think effective batch_size will be 8 times bigger, but I don't have a good intuition how it would relate to the learning rate. Intuitively, I guess keeping the old learning rate, or slightly increase it (0.00008) would still be fine, so I am not sure why it crashes for you...
I see. Thanks for your reply!!! I'm trying your suggestion about the learning rate.
Hello, You shared the
xsum
dataset link here https://github.com/XiangLi1999/PrefixTuning/issues/2However I see from the colab link https://worksheets.codalab.org/bundles/0x58f85171b43f4e61bf411c35faab369d and from the hyperparameters/data directory in https://worksheets.codalab.org/bundles/0xa3f0cd3c10c7490ab508a351968cbdcf that you have used
xsum_news
data. When I checkedxsum_news
, I found that the validation file has7,186
examples. However, the original dataset has11,327
examples. The test set is also different with11,333
examples inxsum_news
vs.20,418
in the original xsum.I was wondering if you could explain the differences in eval/test dataset sizes compared to the original and perhaps provide your script for preprocessing the original xsum.
Thanks!