Q: S2ORC endpoints to get the full text

shubhamagarwal92 commented 1 year ago

Wait! 🤚 Before creating this issue, please read carefully: We want to enable our partners (that's you) to succeed, but our team has limited bandwidth for answering your questions. In order to help us help you more efficiently, kindly spend 2 minutes to check:

is your question already addressed in the FAQs page?
is your question already asked in another ticket? You can filter the issues by adding salient words from your question. You're always welcome to reopen a related question and add a comment to provide more nuances that are relevant to your use case.

... it's OK if you missed something and end up asking something we addressed before, but we might tease you about it a little bit 😉

We're committed to taking action on your issue within 7 days and responding within 10 days. If you don't get a response within 10 days, If we don't address your question in 7 days, please post it on slack to bring our attention to it.

Hi,

Is there any S2ORC endpoint similar to papers endpoint (https://api.semanticscholar.org/graph/v1/paper/batch) that we can batch query using arxiv/mag/corpus ids similar to https://github.com/allenai/s2-folks/blob/main/examples/python/bulk_get_papers_by_pmid/get_papers.py#L33. I need to get the full text as well as annotations shown here https://github.com/allenai/s2-folks/blob/main/examples/python/s2ag_datasets/sample-datasets.py#L15
If no, how do I download the full s2orc corpus similar to the papers as shown here: https://github.com/allenai/s2-folks/blob/main/examples/python/s2ag_datasets/full-datasets.py#L21
Is there also a way to filter using publicationdate?
How do I find the ids of the cited papers in the related work section of the full text? Maybe using annotations?
How do I filter sections to include only the main text of the paper (usually before references -- omitting appendix and other sections)? Is there any page number information?
Do you have any starter code to filter the S2ORC corpus based on Arxiv/Mag ids? I downloaded the papers dataset which is pretty huge - 174G unzipped (30 1.5G files). I am assuming that S2ORC would be also huge, considering it is ~30 4G files compressed. Would the 30 splits in paper-ids (30 500MB compressed) correspond to S2ORC split? However, they have different number of records as per https://api.semanticscholar.org/datasets/v1/release/latest/

s2orc
5M records in 30 4GB files.
paper-ids
450M records in 30 500MB files
papers
200M records in 30 1.5GB files

Your help is highly appreciated! Thanks!

cfiorelli commented 1 year ago

Confirming some progress.. Last inquiries via slack seeking confirmation on this rough process for post processing. While I plan to test procedures here also holding for user to update on current status since this update ~ 6 hours stale.

Filter "related work" by doing regex on text of sectionheader (https://github.com/allenai/s2-folks/blob/main/examples/python/s2ag_datasets/sample-datasets.py#L24)
Find "related work" annotation (r_start, r_end) using sectionheader
For each bibref item (b_start, b_end), filter all those bibref with (b_start, b_end) in (r_start, r_end), ie. r_start < b_start and b_end < r_end -- get corresponding ref_id
For these ref_id , use bibentry to find matched_paper_id
Use the matched_paper_id as corpus_id to filter the text again from second round of s2orc data traversal

cfiorelli commented 1 year ago

resolved via slack

allenai / s2-folks

Q: S2ORC endpoints to get the full text #141