How to download specific domain of paper (BIOMED, CS )in S2ORC by the bulk dataset api ?

wormyu commented 11 months ago

Hi, recently I'm reading your work don't-stop-pretraining. In the work you use BIOMED and CS paper from S2ORC as pretraining corpus. However according the S2ORC README it says "The original S2ORC dataset files are no longer available for download. They were refactored into multiple datasets available through the Semantic Scholar APIs.". And on the Semantic Scholar APIs bulk datasets I only find the whole s2orc datasets by this url :

https://api.semanticscholar.org/datasets/v1/release/2023-07-25/dataset/s2orc

Is there anyway I can download specific domain of paper in S2ORC dataset through this bulk dataset api ?

I have reference to these issue :

https://github.com/allenai/dont-stop-pretraining/issues/4?fbclid=IwAR31jS9-uCjKDUSMnZi07KvuIAQ_xupV-e5luK910GxNdBZmGEV7ArKQty4 : I think this is the old way to download S2ORC and no longer available now, or maybe I'm wrong?
https://github.com/allenai/s2-folks/issues/45

maayansharon10 commented 10 months ago

Hi, Running into similar issue, trying to download all full texts and abstracts from the medical domain, so all full texts/abstracts in s2orc that have pmid or pmcid, hopefully through api. Is it possible? or I have to download all the dataset and then filter it accordingly?. Thank you

cfiorelli commented 9 months ago

@maayansharon10 & @wormyu At this time the only solution is to download and filter.

allenai / s2-folks

How to download specific domain of paper (BIOMED, CS )in S2ORC by the bulk dataset api ? #133