adsabs / adsabs-dev-api

Developer API service description and example client code
162 stars 58 forks source link

Number of rows returned from query limited to 2000? #39

Open jmangum opened 6 years ago

jmangum commented 6 years ago

Hello,

I am trying to extract citation statistics for various journals by running two queries looped over a range in years:

for yr in yearlist: articles = list(ads.SearchQuery(q="(year:"+yr+" bibstem:"+journal+" AND citation_count:[0 TO 999990]) +property:refereed -title:erratum",fl=fllist,rows=3000)) zeroarticles = list(ads.SearchQuery(q="(year:"+yr+" bibstem:"+journal+" AND citation_count:[0 TO 0]) +property:refereed -title:erratum",fl=fllist,rows=3000)) ...

I have found that (1) If I do not set the rows parameter, I get a maximum of 50 results. (2) If I set rows to 2000, I get at most 2000 results. (3) If I set rows to a number larger than 2000, I get a maximum of 2000 results. (4) It does not seem to matter if I set rows to a int or string in the SearchQuery call

I need to be able to return more than 2000 results, or hack around this limit by doing more smaller time range queries (which might cause me to approach my query limit). Is there a reason for the rows=2000 upper limit? If not, can it be increased? Thanks.

-- Jeff

ghost commented 6 years ago

Hi Jeff, You're correct, the maximum number of rows is hardcoded at 2000 for this type of search, even if you set the rows parameter to be higher. To return the next set of results, use the start parameter (more info here: https://github.com/adsabs/adsabs-dev-api/blob/master/search.md#start). For example, to fetch the next set of results, set start += rows (in your example, start += 2000).

Kelly

On Mon, Jun 11, 2018 at 5:28 PM, Jeff Mangum notifications@github.com wrote:

Hello,

I am trying to extract citation statistics for various journals by running two queries looped over a range in years:

for yr in yearlist: articles = list(ads.SearchQuery(q="(year:"+yr+" bibstem:"+journal+" AND citation_count:[0 TO 999990]) +property:refereed -title:erratum",fl=fllist, rows=3000)) zeroarticles = list(ads.SearchQuery(q="(year:"+yr+" bibstem:"+journal+" AND citation_count:[0 TO 0]) +property:refereed -title:erratum",fl=fllist, rows=3000)) ...

I have found that (1) If I do not set the rows parameter, I get a maximum of 50 results. (2) If I set rows to 2000, I get at most 2000 results. (3) If I set rows to a number larger than 2000, I get a maximum of 2000 results. (4) It does not seem to matter if I set rows to a int or string in the SearchQuery call

I need to be able to return more than 2000 results, or hack around this limit by doing more smaller time range queries (which might cause me to approach my query limit). Is there a reason for the rows=2000 upper limit? If not, can it be increased? Thanks.

-- Jeff

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/adsabs/adsabs-dev-api/issues/39, or mute the thread https://github.com/notifications/unsubscribe-auth/AHURkVgL1ivP6EmFD6orVeB9GEvHtq1Eks5t7uEHgaJpZM4UjbUs .

--

Dr. Kelly Lockhart Back-End Developer, NASA Astrophysics Data System Harvard-Smithsonian Center for Astrophysics 60 Garden Street, Cambridge, MA 02138

jmangum commented 6 years ago

Thanks for the response. Tried setting start+=2000, only to get a syntax error:

articles1 = list(ads.SearchQuery(q="(year:"+yr+" bibstem:"+journal+" AND citation_count:[0 TO 999990]) +property:refereed -title:erratum",fl=fllist,start+=2000,rows=2000))
                                                                                                                                                          ^

SyntaxError: invalid syntax

In fact, setting start to 2000 for the second pass through the search results in an index out of range error:

---> 37 articles1 = list(ads.SearchQuery(q="(year:"+yr+" bibstem:"+journal+" AND citation_count:[0 TO 999990]) +property:refereed -title:erratum",fl=fllist,start=2000,rows=2000))

/Users/jmangum/anaconda/lib/python2.7/site-packages/ads/search.pyc in next(self) 490 491 def next(self): --> 492 return self.next() 493 494 def next(self):

/Users/jmangum/anaconda/lib/python2.7/site-packages/ads/search.pyc in next(self) 519 # extended .articles array. 520 self.execute() --> 521 cur = self._articles[self.__iter_counter] 522 523 self.__iter_counter += 1

IndexError: list index out of range

-- Jeff

romanchyla commented 6 years ago

Hi Jeff, it should be start=2000, but you got that right - that error comes from the ads package @andycasey where the iterator is probably not consulting numFound; i'm not familiar with the details of that code but basically it either needs to fetch new results behind the scene (start=current+rows&rows=2000) or exit (stop iteration).

I think you should update your ads package; the code at https://github.com/andycasey/ads/blob/master/ads/search.py#L498 seems right to me

if the problem persists, please create an issue with ads package; possibly the problem is here https://github.com/andycasey/ads/blob/master/ads/search.py#L547 (you did specify start parameter and the package may not be expecting it; but I did look only briefly)

jmangum commented 6 years ago

Thanks Roman. I believe I have the latest update (as it is dated March 27,2017):

torgo:Stats jmangum$ python -c "import ads; print(ads.version)" 0.12.3

I will create an issue with the ads package. Thanks again!

-- Jeff