abertsch72 / unlimiformer

Public repo for the NeurIPS 2023 paper "Unlimiformer: Long-Range Transformers with Unlimited Length Input"
MIT License
1.05k stars 77 forks source link

BookSum_Full BART Baseline script/code #66

Open saxenarohit opened 1 month ago

saxenarohit commented 1 month ago

Hi,

Great work! Thanks for sharing the code.

I have been trying to replicate the simple BART baseline on BookSum_full and I am unable to reproduce the results.

Can you share the code/script you used to train this model https://huggingface.co/abertsch/bart-base-booksum

I was able to replicate the BART baseline for all the other datasets except this one.

Thanks!

saxenarohit commented 1 month ago

Hi @empanada11 ,

I meant that I was able to reproduce the BART baseline (not Unlimiformer) on the gov_report dataset. It seems to me that the BART baseline (not Unlimiformer) on BookSum is first fine-tuned on BookSum.

I saw from other issues (https://github.com/abertsch72/unlimiformer/issues/57) that people were not able to replicate the Unlimiformer results on gov_report. Can you ask for an update on that issue? Let me try running that as well. Also, the datasets from tau/sled don't have a test set. Does that mean the paper reports the development score?

abertsch72 commented 1 month ago

Hey @saxenarohit ! That's strange, thanks for flagging. What do you get when you train BART-base? And what library are you using to evaluate ROUGE?

the datasets from tau/sled don't have a test set. Does that mean the paper reports the development score?

We report the test set scores using test sets from the original datasets, preprocessed to match the SCROLLS dataset formatting. (E.g. for govreport). We do this instead of submitting to the leaderboard because we didn't run all the SCROLLS tasks. There's also development set scores in the appendices, though, if you'd like to work off of those!

@empanada11 sorry to hear that! Can you share what your issue is?

saxenarohit commented 1 month ago

Hi @abertsch72, thanks for your response. I am using transformer evaluate and getting 'Rouge1': 24.42, 'Rouge2': 5.75, 'RougeL': 12.98, 'RougeLsum': 23.00 on testset. These numbers are quite off. Can you please share the script/code/hyperparameters to replicate the results?

abertsch72 commented 1 month ago

Hi @saxenarohit -- sorry for the delay. That definitely sounds quite low-- is it possible you're generating less than 1024 token outputs? I've been traveling but I will dig up the booksum code this weekend!