Arxiv Data - Githubissues

TencentARC / LLaMA-Pro

[ACL 2024] Progressive LLaMA with Block Expansion.

https://tencentarc.github.io/LLaMA-Pro/

Apache License 2.0

449 stars 34 forks source link

Arxiv Data #4

Open ZhengTang1120 opened 6 months ago

ZhengTang1120 commented 6 months ago

Hi,

You mentioned that the model is trained on scientific papers(29B arxiv data) as a part of math component. I am wondering if you included the full articles or just math contents?

Thank you, Zheng Tang

billxbf commented 6 months ago

hills-code commented 6 months ago

I use the arxiv dataset as a subset of the proof-pile-2 dataset (https://huggingface.co/datasets/EleutherAI/proof-pile-2)