Closed milescrawford closed 1 year ago
@milescrawford I see the demo initially says "Will retrieve 11882 documents", then the final retrieval says "10999", and grep -c '"paperId":' sturgeon.jsonl prints 11881. Maybe bug?
The documentation, examples of search operators and everything else is fantastic!
@milescrawford I see the demo initially says "Will retrieve 11882 documents", then the final retrieval says "10999", and grep -c '"paperId":' sturgeon.jsonl prints 11881. Maybe bug?
Yeah... This is because one paper found in ES was not present in DynamoDB - something is a bit out of sync. The total is really more of a close estimate.
I'm not sure I can repair that, but I could highlight that the "total" figure is an estimate in the documentation.
Okay! took all suggestions. Now outputs like this:
milesc@rainier 1 search_bulk ↠ python3 get_dataset.py
Will retrieve an estimated 14358 documents
Retrieved 1000 papers...
Retrieved 2000 papers...
Retrieved 3000 papers...
Retrieved 4000 papers...
Retrieved 5000 papers...
Retrieved 6000 papers...
Retrieved 7000 papers...
Retrieved 8000 papers...
Retrieved 9000 papers...
Retrieved 10000 papers...
Retrieved 11000 papers...
Retrieved 12000 papers...
Retrieved 13000 papers...
Retrieved 14000 papers...
Retrieved 14359 papers...
Done! Retrieved 14359 papers total
Adds an example script that pulls down a small corpus using search bulk.
Usage: