cambialens / lens-api-doc

10 stars 6 forks source link

"Size" "From" not working as expected #10

Closed Lvenable closed 5 years ago

Lvenable commented 5 years ago

If I do a web search in lens.org for publisher ="International Monetary Fund", I get over 17,000 scholarly documents. However, if I do the same search in API and use "size" and "from" to attempt to get 1000 documents at a time, it fails. When "from" is set to zero, and size is "1000", I get 1000 documents returned. When I change "from" to anything over zero, I get response 400. Any idea what I have going wrong in my code? Again..the code below works ok if I have "from" set to "0"

url = 'https://api.lens.org/scholarly/search' data = '''{ "query": { "match_phrase":{ "source.publisher": "International Monetary Fund" } }, "size": 1000, "from": 10, "sort": [ { "year_published": "desc" } ] }'''

rosharma9 commented 5 years ago

Hi @Lvenable , The from and size is for pagination through small number of records as. For large records (than 1000), you should use our cursor based pagination. It was answered in another ticket and should be useful to you.

Please let me know if you need further help. Thank you.

Lvenable commented 5 years ago

Thanks! I hate to ask this, but I am a very notice pythoner. Do you have a complete sample script that is able to pull the scroll id from the current call and pass it to the next call? I’m cross-eyed trying to think how that might be done.

Thanks!

Linda

rosharma9 commented 5 years ago

Hi @Lvenable , You can do something like this to loop for all your records:

import requests
import time
url = 'https://api.lens.org/scholarly/search'
data = '''{
     "query": {
           "match_phrase":{
                "author.affiliation.name": "Harvard University"
           }
     },
     "size": 1,
     "sort": [
           {
                "year_published": "desc"
           }
     ],
     "scroll":"1m"
}'''

headers = {'Authorization': 'Bearer your_token', 'Content-Type': 'application/json'}

def scroll(scroll_id):
  if scroll_id is not None:
    global data
    data = '''{"scroll_id": "%s"}''' % scroll_id
  response = requests.post(url, data=data, headers=headers)
  if response.status_code != requests.codes.ok:
    print response
  elif response.status_code == requests.codes.too_many_requests:
    time.sleep(8)
    scroll(scroll_id)
  else:
    json = response.json()
    scroll_id = json['scroll_id']
    print json['data'] #DO something with your data
    scroll(scroll_id)

scroll(scroll_id=None)