Closed suchenzang closed 1 year ago
Hi @suchenzang!
We did a few ablations training with and w/o citations, and found that citations extracted as-is generally hurt representation learned by the model. We are considering adding them back in with consistent formatting across papers in the next version.
If you need citations sooner, I would recommend requesting access to S2ORC, which peS2o is derived from, and extract citations from there.
Hope this helps!
Best, Luca
Thanks @soldni for the info! Could you clarify what you mean by "hurting representations learned by the model"? I'm currently interpreting that to mean hurting standard NLP benchmarks if citations are improperly included, but perhaps there were other probes into learned representations? What model size was used for these ablations?
Thanks for this work, btw. Open-sourcing processed datasets are sorely needed in the field!
Sure thing!
We run ablations with 1B autoregressive models trained from scratch on different version of peS2o and evaluated zero shot on a suite of downstream tasks (oepnbook QA, copa, rte, sst2, arc easy, hellaswag, sciq, piqa, winogrande) as well as perplexity on the held-out set from M2D2. We also monitor train loss. Broadly, we saw metrics decrease when everything else but which content to include was equal.
We plan to release all models and results in a future manuscript, just did not get a chance to write it down yet š
Grepped for
s2orc
intrain-00019-of-00020.json
, and noticed that bibliographies are not included in papers (leaving citations hanging):Any chance these could be added back in?