andycasey / ads

Python tool for ADS
MIT License
167 stars 71 forks source link

SearchQuery not respecting start parameter #104

Closed jmangum closed 5 years ago

jmangum commented 6 years ago

Hello,

I am trying to extract citation statistics for various journals by running four queries looped over a range in years. I have learned that I need to each of the "articles" and "zeroarticles" queries in two iterations to get around the (apparently) hardcoded limit of 2000 rows:

for yr in yearlist: articles = list(ads.SearchQuery(q="(year:"+yr+" bibstem:"+journal+" AND citation_count:[0 TO 999990]) +property:refereed -title:erratum",fl=fllist,rows=2000)) articles+=list(ads.SearchQuery(q="(year:"+yr+" bibstem:"+journal+" AND citation_count:[0 TO 999990]) +property:refereed -title:erratum",fl=fllist,start=2000,rows=2000)) zeroarticles = list(ads.SearchQuery(q="(year:"+yr+" bibstem:"+journal+" AND citation_count:[0 TO 0]) +property:refereed -title:erratum",fl=fllist,rows=3000)) zeroarticles+=list(ads.SearchQuery(q="(year:"+yr+" bibstem:"+journal+" AND citation_count:[0 TO 0]) +property:refereed -title:erratum",fl=fllist,start=2000,rows=2000))

Running this results in an index out of range error:

---> 37 articles+=list(ads.SearchQuery(q="(year:"+yr+" bibstem:"+journal+" AND citation_count:[0 TO 999990]) +property:refereed -title:erratum",fl=fllist,start=2000,rows=2000))

/Users/jmangum/anaconda/lib/python2.7/site-packages/ads/search.pyc in next(self) 490 491 def next(self): --> 492 return self.next() 493 494 def next(self):

/Users/jmangum/anaconda/lib/python2.7/site-packages/ads/search.pyc in next(self) 519 # extended .articles array. 520 self.execute() --> 521 cur = self._articles[self.__iter_counter] 522 523 self.__iter_counter += 1

IndexError: list index out of range

Is it possible that the package is not expecting the start parameter? Thanks.

-- Jeff

Environment Information:

torgo:Stats jmangum$ python -c "import ads; print(ads.version)" 0.12.3 torgo:Stats jmangum$ python -V Python 2.7.14 :: Anaconda custom (64-bit)

vsudilov commented 6 years ago

I'll look into this, but for the meantime does using the suggested cursorMark way of paginating solve the issue you for?

jmangum commented 6 years ago

Thank you Vladimir. I don't think I am aware of the suggested cursorMark way of paginating.

vsudilov commented 6 years ago

A cursory local test using q = ads.SearchQuery(q="star", start=2000, rows=2000) seems to return the expected data. If you are still having an issue, could you better explain how to reproduce it?

jonnybazookatone commented 6 years ago

Using a query similar to his

>> q = '(year:"2017" bibstem:"ApJ" AND citation_count:[0 TO 999990]) +property:refereed -title:erratum'
>> query = ads.SearchQuery(q=q, start=2000, rows=2000)
>> p = [i for i in query]

IndexError                                Traceback (most recent call last)
<ipython-input-26-34e674731b4a> in <module>()
----> 1 p = [i for i in query]

/Users/jelliott/anaconda2/lib/python2.7/site-packages/ads/search.pyc in next(self)
    490 
    491     def next(self):
--> 492         return self.__next__()
    493 
    494     def __next__(self):

/Users/jelliott/anaconda2/lib/python2.7/site-packages/ads/search.pyc in __next__(self)
    519             # extended .articles array.
    520             self.execute()
--> 521             cur = self._articles[self.__iter_counter]
    522 
    523         self.__iter_counter += 1

IndexError: list index out of range

As opposed to

>> query = ads.SearchQuery(q=q, start=2000, rows=query.response.numFound-2000)
>> p = [i for i in query]
>> len(p)

1059

Haven't checked extensively, but should this line: https://github.com/andycasey/ads/blob/master/ads/search.py#L509

become something like

if len(self.articles) >= self.response.numFound-query.response.json['responseHeader']['params']['start']
vsudilov commented 6 years ago

Oh ok, so it looks like we aren't taking into account which page of results we're on when doing the iteration. If so, your suggestion seems like it would solve it for the start case, but I'm not sure about the cursorMark case. I'll pick it up sometime in the near to mid term future unless someone else gets to it before me .

jonnybazookatone commented 6 years ago

Yeah, I won't be able to do anything relatively quickly.

@jmangum for the time being, you can hack around it by explicitly playing with the numFound parameter. For example:

for yr in yearlist:
    query = ads.SearchQuery(q="(year:"+yr+" bibstem:"+journal+" AND citation_count:[0 TO 999990]) +property:refereed -title:erratum",fl=fllist,rows=2000)
    articles = list(query)

    numFound = query.response.numFound
    start = 2000
    articles+=list(ads.SearchQuery(q="(year:"+yr+" bibstem:"+journal+" AND citation_count:[0 TO 999990]) +property:refereed -title:erratum",fl=fllist,start=start, rows=numFound-start))
jmangum commented 6 years ago

Thank you @jonnybazookatone and @vsudilov for the help and workaround. Works great!

jmangum commented 6 years ago

Sorry to throw this back, but there seems to be something not quite right with the workaround. When rows > 2000, and I need to go through a second search, the final list of articles includes some duplicates. Here is the relevant segment of code (full script attached below):

rowlim = 2000 for yr in yearlist: query = ads.SearchQuery(q="(year:"+yr+" bibstem:"+journal+" AND citation_count:[0 TO 999990]) +property:refereed -title:erratum",fl=fllist,rows=rowlim,sort='pubdate+desc') articles = list(query) numFound = query.response.numFound

print('numFound = '+str(numFound))

pdb.set_trace()

if numFound > rowlim:
    start = rowlim
    articles+=list(ads.SearchQuery(q="(year:"+yr+" bibstem:"+journal+" AND citation_count:[0 TO 999990]) +property:refereed -title:erratum",fl=fllist,start=start,rows=numFound-start,sort='pubdate+desc'))

I also attach an ascii file which lists the duplicates found for a particular search. MNRAScitedupes.txt

-- Jeff

#

Script for extracting article and citation statistics for journals using ADS Bumblebee

Specifically designed to look at zero-citation articles as a function of year.

#

Set variables journal, beginyear, endyear to extract statistics for a specific journal

over a time span of beginyear to endyear

# import ads

import ads.sandbox as ads

import string from astropy.io import ascii from astropy.table import Table import pdb import os

import requests

import json

ads.config.token = 'nwukXXagC7R63FQb0sUOGApJPlplqSXHM6aetOd7'

#

Initialize variables

# journal = 'MNRAS' beginyear = 2008 endyear = 2014 # r = ads.RateLimits('SearchQuery') ads.config.token = 'my token' citesearch = [] zerociteinfo = [] fllist = ['id', 'bibcode', 'title', 'citation_count', 'aff', 'author', 'keyword'] yearlist = [str(yr) for yr in range(beginyear,endyear+1)] outfilecites = journal+'citesearch'+str(beginyear)+str(endyear)+'.txt' outfilezeros = journal+'zerocites'+str(beginyear)+str(endyear)+'.txt' keystrings = ['Star','Stellar','History'] fzero = open(outfilezeros,'w') citedupes = open(journal+'citedupes.txt','w') #

ADS API sets limit to number of rows it can grab to 2000

rowlim = 2000 for yr in yearlist: query = ads.SearchQuery(q="(year:"+yr+" bibstem:"+journal+" AND citation_count:[0 TO 999990]) +property:refereed -title:erratum",fl=fllist,rows=rowlim,sort='pubdate+desc') articles = list(query) numFound = query.response.numFound

print('numFound = '+str(numFound))

pdb.set_trace()

if numFound > rowlim:
    start = rowlim
    articles+=list(ads.SearchQuery(q="(year:"+yr+" bibstem:"+journal+" AND citation_count:[0 TO 999990]) +property:refereed -title:erratum",fl=fllist,start=start,rows=numFound-start,sort='pubdate+desc'))
#
zeroarticles = [a for a in articles if a.citation_count == 0]
bibcodes = []
# Determine uniqueness of articles list
for j in range(len(articles)):
    bibcodes.append(str(articles[j].bibcode))
seen = {}
dupes = []
for x in bibcodes:
    if x not in seen:
        seen[x] = 1
    else:
        if seen[x] == 1:
            dupes.append(x)
        seen[x] += 1
citedupes.write('Year = '+str(yr)+' Number of Articles = '+str(len(articles))+'\n'+'Dupes: '+str([dupes[j] for j in range(len(dupes))])+'\n')
#zeroquery = ads.SearchQuery(q="(year:"+yr+" bibstem:"+journal+" AND citation_count:[0 TO 0]) +property:refereed -title:erratum",fl=fllist,rows=rowlim)
#zeroarticles = list(zeroquery)
#numFound = zeroquery.response.numFound
#if numFound > rowlim:
#    start = rowlim
#    zeroarticles+=list(ads.SearchQuery(q="(year:"+yr+" bibstem:"+journal+" AND citation_count:[0 TO 0]) +property:refereed -title:erratum",fl=fllist,start=start,rows=numFound-start))

print('...Finished queries...now building output payload for '+journal+' for year '+yr+'...')
totalcites = 0
cite2011sum = 0
#for j in range(0,len(articles)):
#    totalcites+=articles[j].citation_count
totalcites = sum([a.citation_count for a in articles])
print('year = '+str(yr)+' totalcites = '+str(totalcites))

if yr == '2011':

for j in range(len(articles)):

cite2011.write(str(articles[j].bibcode)+','+str(articles[j].citation_count)+'\n')

cite2011sum = sum([a.citation_count for a in articles])

citesearch.append([string.atoi(yr),len(articles),totalcites,len(zeroarticles)])
zerociteinfo.append([string.atoi(yr),len(zeroarticles),[zeroarticles[i].bibcode for i in range(0,len(zeroarticles))],[zeroarticles[i].author for i in range(0,len(zeroarticles))],[zeroarticles[i].aff for i in range(0,len(zeroarticles))],[zeroarticles[i].keyword for i in range(0,len(zeroarticles))]])
#
fzero.write('\n'+yr+'\n')
print('Writing zero citation articles for '+yr+'...')
countkey = [0 for key in keystrings]
for j in range(0,len(zeroarticles)):
    titlestr = ''
    authstr = ''
    affstr = ''
    keywordstr = ''
    # NOTE: Need to strip unicode characters in order to print later
    for title in zeroarticles[j].title:
        #titlestr+=str(title)+';'
        titlestr+=''.join([x for x in title if ord(x) < 127])
    try:
        for auth in zeroarticles[j].author:
            #authstr+=str(auth)+';'
            authstr+=''.join([x for x in auth if ord(x) < 127])
        for aff in zeroarticles[j].aff:
            #affstr+=str(aff)+';'
            affstr+=''.join([x for x in aff if ord(x) < 127])
        for keyw in zeroarticles[j].keyword:
            #keywordstr+=str(keyw)+';'
            keywordstr+=''.join([x for x in keyw if ord(x) < 127])
    except(TypeError):
        pass
    #pdb.set_trace()
    fzero.write('=====================\n'+zeroarticles[j].bibcode+'\n'+titlestr+'\n'+authstr+'\n'+affstr+'\n'+keywordstr+'\n')
    try:
        for keystr in keystrings:
            if len([s for s in zeroarticles[j].keyword if keystr in s]) != 0:
                countkey[keystrings.index(keystr)]+=1 
    except(TypeError):
        pass
fzero.write('========= Numbers of Articles with at Least One Keyword Occurrence =========\n')
for keystr in keystrings:
    fzero.write(keystr+' articles = '+str(countkey[keystrings.index(keystr)])+' of '+str(len(zeroarticles))+' in year '+str(yr)+'\n')

#

Write citesearch results to ascii table for export using astropy ascii table writer

# citedupes.close() fzero.close() zipcitesearch = zip(citesearch) zipzeroarticlesinfo = zip(zerociteinfo) citesearchdat = Table([list(zipcitesearch[0]),list(zipcitesearch[1]),list(zipcitesearch[2]),list(zipcitesearch[3])],names=['Year','Total Articles','Total Cites','Zero Cites']) zeroarticlesdat = Table([list(zipzeroarticlesinfo[0]),list(zipzeroarticlesinfo[1]),list(zipzeroarticlesinfo[2]),list(zipzeroarticlesinfo[3]),list(zipzeroarticlesinfo[4]),list(zipzeroarticlesinfo[5])],names=['Year','Zero Cites','Bibcodes','ZeroAuthors','ZeroAff','ZeroKeywords']) ascii.write(citesearchdat,outfilecites,format='csv',overwrite=True)

ascii.write(zeroarticlesdat,outfilezeros,format='csv',overwrite=True)

print(r.limits) print('Reset date: ') os.system('date -r '+r.limits['reset'])

vsudilov commented 6 years ago

I'm happy to follow up to make sure that this library takes into account the current page of the result set when deciding to continue iteration or not, which seems to be the initial bug you reported. Maybe @aaccomazzi could help with your specific use case at the moment?

jmangum commented 6 years ago

Thanks @vsudilov. I have been working with Edwin Henneken on the ADS side, but since this bug involves the rows limit in the ads package, there really does not seem to be anything that ADS can do.

A comment about the search I included in the previously attached version of the script. I added a sort a while back for debugging, but have done a test where I remove the sort. With the sort removed I get a different list of duplicates.

jmangum commented 6 years ago

In case it helps, there does seem to be a dependence on whether I sort the search output or not with regards the number of duplicates I get. For the same search (year and journal) I get a much smaller number of duplicates if I sort by pubdate. Attached are two files which list the duplicates for both tests. MNRAScitedupes-nosort.txt MNRAScitedupes-withsortbypubdate.txt

vsudilov commented 6 years ago

Just to be clear, my understanding of this issue is that the library doesn't fail gracefully when iterating on the last page of results. The data will be unchanged if you simply catch the exception; any change for this issue is not expected to change the results retrieved.

On Fri, Jun 15, 2018, 09:02 Jeff Mangum notifications@github.com wrote:

Thanks @vsudilov https://github.com/vsudilov. I have been working with Edwin Henneken on the ADS side, but since this bug involves the rows limit in the ads package, there really does not seem to be anything that ADS can do.

A comment about the search I included in the previously attached version of the script. I added a sort a while back for debugging, but have done a test where I remove the sort. With the sort removed I get a different list of duplicates.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/andycasey/ads/issues/104#issuecomment-397613201, or mute the thread https://github.com/notifications/unsubscribe-auth/ADV6ZDgvBtzJf7FPLLXlSI819ufGBa9xks5t87BVgaJpZM4UkS7J .

jmangum commented 6 years ago

I have found a workaround to the duplication issue. My sorting tests where sorting by pubdate, which I realized is not fine-grained enough to assure uniqueness from query-to-query. By sorting by a unique article identifier, I figured that I would then get unique query results from one query to the next. Indeed, by using sort='bibcode+asc' I get the right results. Spot-checked with output from the API and found to be correct.

I don't know what this says about the start and row parameters in the SearchQuery, other than it clearly does not currently allow for indexing through multiple queries.

aaccomazzi commented 6 years ago

First a bit of clarification about the two ways in which pagination is supported via the ADS API. When start is not supplied, the follow-up queries use SOLR's cursorMark feature which is supposed to provide efficient pagination of the search results. This is what I would recommend everybody do, but obviously we can't enforce it. Our back-end architecture was built with the intent to route the follow-up queries to the same SOLR instance so that cursorMark can be honored, but it's possible that may be issues with our implementation (or simply the fact that the SOLR instance which responded to the original query goes out of service and subsequent queries are not able to be satisfied). We welcome any feedback on this.

Now as to the current issue(s):

  1. It appears that there is a bug in the library code which does not properly account for the number of remaining records when pagination occurs via start and rows. This should be a relatively easy fix on your end, I believe as @vsudilov and @jonnybazookatone have identified the problem.

  2. Even with this bug fixed (or with the suggested workaround applied) there may be an issue with inconsistent results being returned due to randomness in the way the list of result is ordered. SOLR orders its results by score or, when provided, a specific sort order. I'm not 100% sure about this but I suspect that in a list of results, records which have the same score may appear in different order when the same query is re-issued at a later time, unless the sort order forces a unique ordering. Which means that a follow-up query with the same q but a different start may have the order of a few records changed.

  3. Related to the above (and to my surprise) I just noticed that the default sort order is changed between initial and follow-up query in the library (see https://github.com/andycasey/ads/blob/master/ads/search.py#L388). This seems wrong to me, as it potentially triggers non-deterministic ranking which can lead to record duplication or miss when the results are retrieved chunk after chunk via the start mechanism. To avoid these problems, I would always suggest adding a secondary sort which makes the list deterministic (such as id desc) whether or not a sort order is specified by the user.

Encouraging additional feedback on this by @romanchyla and @marblestation and FYI @ehenneken

jmangum commented 6 years ago

Thank you for the insight @aaccomazzi. It appears that (completely by accident) I have hit on a reasonable solution to this problem. Just out of curiosity, is it better to use a sort by id rather than bibcode (as I have done)?

aaccomazzi commented 6 years ago

Sorting by id will shave ~3 milliseconds on a query response ;-) (I personally tend to sort by bibcode desc so that I see the most recent articles first)

jmangum commented 6 years ago

Ah! Got to have the speed! Thanks @aaccomazzi.

vsudilov commented 6 years ago

Re point 3, that's setting defaults during query/object initialization -- it remains unmodified during the lifetime of that object, and thus sorting is expected to remain unchanged. The object is intended to be used to fetch all results corresponding to a single logical search -- it supports fetching all results via start/rows, cursorMark, and max_pages as pagination control mechanisms.

If a user wants to instantiate multiple searchQuery objects that correspond to one logical search, it is up to the user to ensure those objects are instantiated in the same way. Apologies for not picking up on that earlier, @jmangum .

jonnybazookatone commented 5 years ago

Closing due to staleness.

oneLuckyfish commented 1 year ago

Why is the total number of filtered articles and citation_count cumulative number inconsistent with manual searches? Thank you! 微信图片_20221116211937 微信图片_20221116211947

`import ads import os import datetime as dt import pandas as pd import matplotlib import matplotlib.pyplot as plt import numpy as np

token = '*****'

def query_counts(keywords, query, year, acknowledgements=False): if acknowledgements: query = 'ack:' + query modifiers = ' '.join([f'year:{year}']) full_query = ' '.join([f"abs:('{query}')", modifiers]) filter_query = ['database:astronomy', 'property:refereed', 'doctype:article'] papers = ads.SearchQuery(q=full_query, fq=filter_query, token=token, sort="citation_count") papers.execute() results_count = int(papers.response.numFound)

print(count)

citation_count_num = 0
for n in papers.articles:
    citation_count_num += n.citation_count
print(modifiers, full_query, results_count, citation_count_num)
return dict(keywords=keywords, query=query, year=year, count=results_count, citation_count_num=citation_count_num)

DATA = { 'LAMOST': ['LAMOST'], 'SDSS': ['SDSS'], 'SDSS_Official': ['"BOSS" OR "APOGEE" OR "eBOSS" OR "MARVELS" OR "MANGA" OR "SDSS" OR ("Sloan" AND "Survey")) OR ' 'title:("BOSS" OR "APOGEE" OR "eBOSS" OR "MARVELS" OR "MANGA" OR "SDSS" OR ("Sloan" AND ' '"Survey")'], 'SDSS Spectrum': ['SDSS Spectrum'], }

filename = 'ADS_results1.csv' years = [] for y in range(2022, 2023): years.append(str(y)) years.append('1994-2022') if not os.path.exists(filename): results = pd.DataFrame([query_counts(keywords, query, year) for keywords, queries in DATA.items() for query in queries for year in years]) results.to_csv(filename, index=False)`

aaccomazzi commented 1 year ago

Your UI filters show that you are filtering by requiring a publication type of "Article." Because this is a hierarchical filter, it's a little tricky to deal with. Publication Types of "Article" corresponds to the union of the following doctypes: article (for journal articles), eprint (for preprints), inbook (for book chapters), and inproceedings (for conference proceedings articles), so the proper query is: doctype:(article OR eprint OR inbook OR inproceedings). The filter query accomplishes this by using a slightly different approach which is a bit cryptic, but FYI: doctype_facet_hier:0/Article. Either method will work fine for what you want.

image

oneLuckyfish commented 1 year ago

您的 UI 筛选器显示您正在通过要求出版物类型“文章”进行筛选。因为这是一个分层过滤器,所以处理起来有点棘手。“文章”的出版物类型对应于以下文档类型的并集:(对于期刊文章),(对于预印本),(对于书籍章节)和(对于会议论文集文章),因此正确的查询是:。筛选器查询通过使用略有不同的方法来实现这一点,这种方法有点神秘,但仅供参考:。任何一种方法都可以正常工作。article``eprint``inbook``inproceedings``doctype:(article OR eprint OR inbook OR inproceedings)``doctype_facet_hier:0/Article

图像

Thank you very much! Your tips were very helpful to me and now work properly. Also, can you provide a use case (preferably code) for CursorMark?

aaccomazzi commented 1 year ago

You should never need to use CursorMark as it's a server-generated identifier that allows efficient paginating of results. This library (in the SearchQuery class) properly deals with managing CursorMark for you when follow-up queries are requested. If you want all results to be returned, simply set a high max_pages parameter and they will be fetched for you iteratively.

oneLuckyfish commented 1 year ago

You should never need to use CursorMark as it's a server-generated identifier that allows efficient paginating of results. This library (in the SearchQuery class) properly deals with managing CursorMark for you when follow-up queries are requested. If you want all results to be returned, simply set a high max_pages parameter and they will be fetched for you iteratively.

I've tried increasing the max_pages parameters, but I still only return 2000 results, and I can't get correct citations for more than 2,000 articles. I still need your guidance, thank you very much! Here's the code I tried using cursorMark. My first modified code has been put into my ads_test repository.

papers = ads.SearchQuery(q=full_query, fq=filter_query, token=token, rows=2000, max_pages=100, sort="citation_count")

aaccomazzi commented 1 year ago

This is what I use and it works fine with pagination: https://gist.github.com/aaccomazzi/b205a41fcee5f31065816eb9f06f748a

For example:

$ python adsquery.py --pages 2000 --format csv SDSS > SDSS.csv
$ wc -l SDSS.csv
   22692 SDSS.csv
oneLuckyfish commented 1 year ago

This is what I use and it works fine with pagination: https://gist.github.com/aaccomazzi/b205a41fcee5f31065816eb9f06f748a

For example:

$ python adsquery.py --pages 2000 --format csv SDSS > SDSS.csv
$ wc -l SDSS.csv
   22692 SDSS.csv

For example, I want to get the number of all SDSS articles from 1994 to 2022, which can be achieved with SearchQuery and get the correct result. 1668854886491 1668855013810 This result has clearly exceeded the maximum value of 2000, and even with a large max_pages, only 2000 articles can be seen through Pycharm's debug. image Therefore, I cannot accumulate the citations of 10,419 articles, but only get the total citations of the first 2,000 articles. I wanted to ask if I could use cursorMark instead of the while loop in the ads_test in my repository to add up the citations of 10,419 articles. (ads_test: https://github.com/oneLuckyfish/ads_test/blob/main/ads_test_2.py Thank you again for your patience!

aaccomazzi commented 1 year ago

The script in question does not set max_pages at all, which is the problem.
Use the logic implemented in my script and it will work, no need to use the start parameter because pagination is taken care of already.