allenai / s2-folks

Public space for the user community of Semantic Scholar APIs to share scripts, report issues, and make suggestions.
Other
190 stars 29 forks source link

Example of using bulk search #144

Closed milescrawford closed 1 year ago

milescrawford commented 1 year ago

Adds an example script that pulls down a small corpus using search bulk.

Usage:

milesc@rainier 0 search_bulk ↠ python3 get_sturgeon_dataset.py
Will retrieve 11882 documents
Retrieved 1000 papers...
Retrieved 2000 papers...
Retrieved 3000 papers...
Retrieved 4000 papers...
Retrieved 5000 papers...
Retrieved 6000 papers...
Retrieved 6999 papers...
Retrieved 7999 papers...
Retrieved 8999 papers...
Retrieved 9999 papers...
Retrieved 10999 papers...
Done!

milesc@rainier 0 search_bulk ↠ head sturgeon.jsonl
{"paperId": "00000bb81d515d106dcd455357c1bae69a0eb1ee", "title": "Effect of chemical and physical factors on infectivity of Siberian sturgeon herpesvirus", "year": 2010}
{"paperId": "00045cbc571588950a22774175a7c06abec0be89", "title": "Marine Migration of North American Green Sturgeon", "year": 2008}
{"paperId": "000698bb3e1b9c75e3d78be9ab0d56dbe1e7514e", "title": "Amino Acid Composition in Different Parts of Farmed Sturgeon Acipenser schrenckii", "year": 2006}
...
cfiorelli commented 1 year ago

@milescrawford I see the demo initially says "Will retrieve 11882 documents", then the final retrieval says "10999", and grep -c '"paperId":' sturgeon.jsonl prints 11881. Maybe bug?

The documentation, examples of search operators and everything else is fantastic!

milescrawford commented 1 year ago

@milescrawford I see the demo initially says "Will retrieve 11882 documents", then the final retrieval says "10999", and grep -c '"paperId":' sturgeon.jsonl prints 11881. Maybe bug?

Yeah... This is because one paper found in ES was not present in DynamoDB - something is a bit out of sync. The total is really more of a close estimate.

I'm not sure I can repair that, but I could highlight that the "total" figure is an estimate in the documentation.

milescrawford commented 1 year ago

Okay! took all suggestions. Now outputs like this:

milesc@rainier 1 search_bulk ↠ python3 get_dataset.py
Will retrieve an estimated 14358 documents
Retrieved 1000 papers...
Retrieved 2000 papers...
Retrieved 3000 papers...
Retrieved 4000 papers...
Retrieved 5000 papers...
Retrieved 6000 papers...
Retrieved 7000 papers...
Retrieved 8000 papers...
Retrieved 9000 papers...
Retrieved 10000 papers...
Retrieved 11000 papers...
Retrieved 12000 papers...
Retrieved 13000 papers...
Retrieved 14000 papers...
Retrieved 14359 papers...
Done! Retrieved 14359 papers total