allenai / s2-folks

Public space for the user community of Semantic Scholar APIs to share scripts, report issues, and make suggestions.
Other
190 stars 29 forks source link

Q: S2ORC endpoints to get the full text #141

Closed shubhamagarwal92 closed 1 year ago

shubhamagarwal92 commented 1 year ago

Wait! 🤚 Before creating this issue, please read carefully: We want to enable our partners (that's you) to succeed, but our team has limited bandwidth for answering your questions. In order to help us help you more efficiently, kindly spend 2 minutes to check:

... it's OK if you missed something and end up asking something we addressed before, but we might tease you about it a little bit 😉

We're committed to taking action on your issue within 7 days and responding within 10 days. If you don't get a response within 10 days, If we don't address your question in 7 days, please post it on slack to bring our attention to it.

Hi,

  1. Is there any S2ORC endpoint similar to papers endpoint (https://api.semanticscholar.org/graph/v1/paper/batch) that we can batch query using arxiv/mag/corpus ids similar to https://github.com/allenai/s2-folks/blob/main/examples/python/bulk_get_papers_by_pmid/get_papers.py#L33. I need to get the full text as well as annotations shown here https://github.com/allenai/s2-folks/blob/main/examples/python/s2ag_datasets/sample-datasets.py#L15
  2. If no, how do I download the full s2orc corpus similar to the papers as shown here: https://github.com/allenai/s2-folks/blob/main/examples/python/s2ag_datasets/full-datasets.py#L21
  3. Is there also a way to filter using publicationdate?
  4. How do I find the ids of the cited papers in the related work section of the full text? Maybe using annotations?
  5. How do I filter sections to include only the main text of the paper (usually before references -- omitting appendix and other sections)? Is there any page number information?
  6. Do you have any starter code to filter the S2ORC corpus based on Arxiv/Mag ids? I downloaded the papers dataset which is pretty huge - 174G unzipped (30 1.5G files). I am assuming that S2ORC would be also huge, considering it is ~30 4G files compressed. Would the 30 splits in paper-ids (30 500MB compressed) correspond to S2ORC split? However, they have different number of records as per https://api.semanticscholar.org/datasets/v1/release/latest/
s2orc
5M records in 30 4GB files.
paper-ids
450M records in 30 500MB files
papers
200M records in 30 1.5GB files

Your help is highly appreciated! Thanks!

cfiorelli commented 1 year ago

Confirming some progress.. Last inquiries via slack seeking confirmation on this rough process for post processing. While I plan to test procedures here also holding for user to update on current status since this update ~ 6 hours stale.

  1. Filter "related work" by doing regex on text of sectionheader (https://github.com/allenai/s2-folks/blob/main/examples/python/s2ag_datasets/sample-datasets.py#L24)
  2. Find "related work" annotation (r_start, r_end) using sectionheader
  3. For each bibref item (b_start, b_end), filter all those bibref with (b_start, b_end) in (r_start, r_end), ie. r_start < b_start and b_end < r_end -- get corresponding ref_id
  4. For these ref_id , use bibentry to find matched_paper_id
  5. Use the matched_paper_id as corpus_id to filter the text again from second round of s2orc data traversal
cfiorelli commented 1 year ago

resolved via slack