Closed wormyu closed 1 year ago
Hi, Running into similar issue, trying to download all full texts and abstracts from the medical domain, so all full texts/abstracts in s2orc that have pmid or pmcid, hopefully through api. Is it possible? or I have to download all the dataset and then filter it accordingly?. Thank you
@maayansharon10 & @wormyu At this time the only solution is to download and filter.
Hi, recently I'm reading your work don't-stop-pretraining. In the work you use BIOMED and CS paper from S2ORC as pretraining corpus. However according the S2ORC README it says "The original S2ORC dataset files are no longer available for download. They were refactored into multiple datasets available through the Semantic Scholar APIs.". And on the Semantic Scholar APIs bulk datasets I only find the whole s2orc datasets by this url :
https://api.semanticscholar.org/datasets/v1/release/2023-07-25/dataset/s2orc
Is there anyway I can download specific domain of paper in S2ORC dataset through this bulk dataset api ?
I have reference to these issue :
https://github.com/allenai/dont-stop-pretraining/issues/4?fbclid=IwAR31jS9-uCjKDUSMnZi07KvuIAQ_xupV-e5luK910GxNdBZmGEV7ArKQty4 : I think this is the old way to download S2ORC and no longer available now, or maybe I'm wrong?
https://github.com/allenai/s2-folks/issues/45