allenai / scibert

A BERT model for scientific text.
https://arxiv.org/abs/1903.10676
Apache License 2.0
1.47k stars 214 forks source link

SciBERT Corpus Availability? #87

Open CyndxAI opened 4 years ago

CyndxAI commented 4 years ago

Hi, do you plan to make the pretraining corpus available, or provide a way to reproduce / approximate it using Semantic Scholar?

kyleclo commented 4 years ago

Hey @CyndxAI unfortunately the SciBERT pretraining corpus is not publicly available. If you're interested in a large pretraining corpus for training these large language models, I can point you to another project from our team: https://github.com/allenai/s2orc, which provides 70M+ paper abstracts and 8M+ full text papers. Should be plenty of text to train on. If you check out preprint: https://arxiv.org/pdf/1911.02782.pdf you can see that we reproduced SciBERT results using this corpus.

CyndxAI commented 4 years ago

Perfect, that works. Thank you!