allenai / s2-folks

Public space for the user community of Semantic Scholar APIs to share scripts, report issues, and make suggestions.
Other
144 stars 25 forks source link

How to download specific domain of paper (BIOMED, CS )in S2ORC by the bulk dataset api ? #133

Closed wormyu closed 9 months ago

wormyu commented 11 months ago

Hi, recently I'm reading your work don't-stop-pretraining. In the work you use BIOMED and CS paper from S2ORC as pretraining corpus. image However according the S2ORC README it says "The original S2ORC dataset files are no longer available for download. They were refactored into multiple datasets available through the Semantic Scholar APIs.". And on the Semantic Scholar APIs bulk datasets I only find the whole s2orc datasets by this url :

https://api.semanticscholar.org/datasets/v1/release/2023-07-25/dataset/s2orc

Is there anyway I can download specific domain of paper in S2ORC dataset through this bulk dataset api ?

I have reference to these issue :

maayansharon10 commented 10 months ago

Hi, Running into similar issue, trying to download all full texts and abstracts from the medical domain, so all full texts/abstracts in s2orc that have pmid or pmcid, hopefully through api. Is it possible? or I have to download all the dataset and then filter it accordingly?. Thank you

cfiorelli commented 9 months ago

@maayansharon10 & @wormyu At this time the only solution is to download and filter.