Closed jmangum closed 5 years ago
I'll look into this, but for the meantime does using the suggested cursorMark
way of paginating solve the issue you for?
Thank you Vladimir. I don't think I am aware of the suggested cursorMark way of paginating.
A cursory local test using q = ads.SearchQuery(q="star", start=2000, rows=2000)
seems to return the expected data. If you are still having an issue, could you better explain how to reproduce it?
Using a query similar to his
>> q = '(year:"2017" bibstem:"ApJ" AND citation_count:[0 TO 999990]) +property:refereed -title:erratum'
>> query = ads.SearchQuery(q=q, start=2000, rows=2000)
>> p = [i for i in query]
IndexError Traceback (most recent call last)
<ipython-input-26-34e674731b4a> in <module>()
----> 1 p = [i for i in query]
/Users/jelliott/anaconda2/lib/python2.7/site-packages/ads/search.pyc in next(self)
490
491 def next(self):
--> 492 return self.__next__()
493
494 def __next__(self):
/Users/jelliott/anaconda2/lib/python2.7/site-packages/ads/search.pyc in __next__(self)
519 # extended .articles array.
520 self.execute()
--> 521 cur = self._articles[self.__iter_counter]
522
523 self.__iter_counter += 1
IndexError: list index out of range
As opposed to
>> query = ads.SearchQuery(q=q, start=2000, rows=query.response.numFound-2000)
>> p = [i for i in query]
>> len(p)
1059
Haven't checked extensively, but should this line: https://github.com/andycasey/ads/blob/master/ads/search.py#L509
become something like
if len(self.articles) >= self.response.numFound-query.response.json['responseHeader']['params']['start']
Oh ok, so it looks like we aren't taking into account which page of results we're on when doing the iteration. If so, your suggestion seems like it would solve it for the start
case, but I'm not sure about the cursorMark
case. I'll pick it up sometime in the near to mid term future unless someone else gets to it before me .
Yeah, I won't be able to do anything relatively quickly.
@jmangum for the time being, you can hack around it by explicitly playing with the numFound parameter. For example:
for yr in yearlist:
query = ads.SearchQuery(q="(year:"+yr+" bibstem:"+journal+" AND citation_count:[0 TO 999990]) +property:refereed -title:erratum",fl=fllist,rows=2000)
articles = list(query)
numFound = query.response.numFound
start = 2000
articles+=list(ads.SearchQuery(q="(year:"+yr+" bibstem:"+journal+" AND citation_count:[0 TO 999990]) +property:refereed -title:erratum",fl=fllist,start=start, rows=numFound-start))
Thank you @jonnybazookatone and @vsudilov for the help and workaround. Works great!
Sorry to throw this back, but there seems to be something not quite right with the workaround. When rows > 2000, and I need to go through a second search, the final list of articles includes some duplicates. Here is the relevant segment of code (full script attached below):
rowlim = 2000 for yr in yearlist: query = ads.SearchQuery(q="(year:"+yr+" bibstem:"+journal+" AND citation_count:[0 TO 999990]) +property:refereed -title:erratum",fl=fllist,rows=rowlim,sort='pubdate+desc') articles = list(query) numFound = query.response.numFound
if numFound > rowlim:
start = rowlim
articles+=list(ads.SearchQuery(q="(year:"+yr+" bibstem:"+journal+" AND citation_count:[0 TO 999990]) +property:refereed -title:erratum",fl=fllist,start=start,rows=numFound-start,sort='pubdate+desc'))
I also attach an ascii file which lists the duplicates found for a particular search. MNRAScitedupes.txt
-- Jeff
#
#
# import ads
import string from astropy.io import ascii from astropy.table import Table import pdb import os
#
# journal = 'MNRAS' beginyear = 2008 endyear = 2014 # r = ads.RateLimits('SearchQuery') ads.config.token = 'my token' citesearch = [] zerociteinfo = [] fllist = ['id', 'bibcode', 'title', 'citation_count', 'aff', 'author', 'keyword'] yearlist = [str(yr) for yr in range(beginyear,endyear+1)] outfilecites = journal+'citesearch'+str(beginyear)+str(endyear)+'.txt' outfilezeros = journal+'zerocites'+str(beginyear)+str(endyear)+'.txt' keystrings = ['Star','Stellar','History'] fzero = open(outfilezeros,'w') citedupes = open(journal+'citedupes.txt','w') #
rowlim = 2000 for yr in yearlist: query = ads.SearchQuery(q="(year:"+yr+" bibstem:"+journal+" AND citation_count:[0 TO 999990]) +property:refereed -title:erratum",fl=fllist,rows=rowlim,sort='pubdate+desc') articles = list(query) numFound = query.response.numFound
if numFound > rowlim:
start = rowlim
articles+=list(ads.SearchQuery(q="(year:"+yr+" bibstem:"+journal+" AND citation_count:[0 TO 999990]) +property:refereed -title:erratum",fl=fllist,start=start,rows=numFound-start,sort='pubdate+desc'))
#
zeroarticles = [a for a in articles if a.citation_count == 0]
bibcodes = []
# Determine uniqueness of articles list
for j in range(len(articles)):
bibcodes.append(str(articles[j].bibcode))
seen = {}
dupes = []
for x in bibcodes:
if x not in seen:
seen[x] = 1
else:
if seen[x] == 1:
dupes.append(x)
seen[x] += 1
citedupes.write('Year = '+str(yr)+' Number of Articles = '+str(len(articles))+'\n'+'Dupes: '+str([dupes[j] for j in range(len(dupes))])+'\n')
#zeroquery = ads.SearchQuery(q="(year:"+yr+" bibstem:"+journal+" AND citation_count:[0 TO 0]) +property:refereed -title:erratum",fl=fllist,rows=rowlim)
#zeroarticles = list(zeroquery)
#numFound = zeroquery.response.numFound
#if numFound > rowlim:
# start = rowlim
# zeroarticles+=list(ads.SearchQuery(q="(year:"+yr+" bibstem:"+journal+" AND citation_count:[0 TO 0]) +property:refereed -title:erratum",fl=fllist,start=start,rows=numFound-start))
print('...Finished queries...now building output payload for '+journal+' for year '+yr+'...')
totalcites = 0
cite2011sum = 0
#for j in range(0,len(articles)):
# totalcites+=articles[j].citation_count
totalcites = sum([a.citation_count for a in articles])
print('year = '+str(yr)+' totalcites = '+str(totalcites))
citesearch.append([string.atoi(yr),len(articles),totalcites,len(zeroarticles)])
zerociteinfo.append([string.atoi(yr),len(zeroarticles),[zeroarticles[i].bibcode for i in range(0,len(zeroarticles))],[zeroarticles[i].author for i in range(0,len(zeroarticles))],[zeroarticles[i].aff for i in range(0,len(zeroarticles))],[zeroarticles[i].keyword for i in range(0,len(zeroarticles))]])
#
fzero.write('\n'+yr+'\n')
print('Writing zero citation articles for '+yr+'...')
countkey = [0 for key in keystrings]
for j in range(0,len(zeroarticles)):
titlestr = ''
authstr = ''
affstr = ''
keywordstr = ''
# NOTE: Need to strip unicode characters in order to print later
for title in zeroarticles[j].title:
#titlestr+=str(title)+';'
titlestr+=''.join([x for x in title if ord(x) < 127])
try:
for auth in zeroarticles[j].author:
#authstr+=str(auth)+';'
authstr+=''.join([x for x in auth if ord(x) < 127])
for aff in zeroarticles[j].aff:
#affstr+=str(aff)+';'
affstr+=''.join([x for x in aff if ord(x) < 127])
for keyw in zeroarticles[j].keyword:
#keywordstr+=str(keyw)+';'
keywordstr+=''.join([x for x in keyw if ord(x) < 127])
except(TypeError):
pass
#pdb.set_trace()
fzero.write('=====================\n'+zeroarticles[j].bibcode+'\n'+titlestr+'\n'+authstr+'\n'+affstr+'\n'+keywordstr+'\n')
try:
for keystr in keystrings:
if len([s for s in zeroarticles[j].keyword if keystr in s]) != 0:
countkey[keystrings.index(keystr)]+=1
except(TypeError):
pass
fzero.write('========= Numbers of Articles with at Least One Keyword Occurrence =========\n')
for keystr in keystrings:
fzero.write(keystr+' articles = '+str(countkey[keystrings.index(keystr)])+' of '+str(len(zeroarticles))+' in year '+str(yr)+'\n')
#
# citedupes.close() fzero.close() zipcitesearch = zip(citesearch) zipzeroarticlesinfo = zip(zerociteinfo) citesearchdat = Table([list(zipcitesearch[0]),list(zipcitesearch[1]),list(zipcitesearch[2]),list(zipcitesearch[3])],names=['Year','Total Articles','Total Cites','Zero Cites']) zeroarticlesdat = Table([list(zipzeroarticlesinfo[0]),list(zipzeroarticlesinfo[1]),list(zipzeroarticlesinfo[2]),list(zipzeroarticlesinfo[3]),list(zipzeroarticlesinfo[4]),list(zipzeroarticlesinfo[5])],names=['Year','Zero Cites','Bibcodes','ZeroAuthors','ZeroAff','ZeroKeywords']) ascii.write(citesearchdat,outfilecites,format='csv',overwrite=True)
print(r.limits) print('Reset date: ') os.system('date -r '+r.limits['reset'])
I'm happy to follow up to make sure that this library takes into account the current page of the result set when deciding to continue iteration or not, which seems to be the initial bug you reported. Maybe @aaccomazzi could help with your specific use case at the moment?
Thanks @vsudilov. I have been working with Edwin Henneken on the ADS side, but since this bug involves the rows limit in the ads package, there really does not seem to be anything that ADS can do.
A comment about the search I included in the previously attached version of the script. I added a sort a while back for debugging, but have done a test where I remove the sort. With the sort removed I get a different list of duplicates.
In case it helps, there does seem to be a dependence on whether I sort the search output or not with regards the number of duplicates I get. For the same search (year and journal) I get a much smaller number of duplicates if I sort by pubdate. Attached are two files which list the duplicates for both tests. MNRAScitedupes-nosort.txt MNRAScitedupes-withsortbypubdate.txt
Just to be clear, my understanding of this issue is that the library doesn't fail gracefully when iterating on the last page of results. The data will be unchanged if you simply catch the exception; any change for this issue is not expected to change the results retrieved.
On Fri, Jun 15, 2018, 09:02 Jeff Mangum notifications@github.com wrote:
Thanks @vsudilov https://github.com/vsudilov. I have been working with Edwin Henneken on the ADS side, but since this bug involves the rows limit in the ads package, there really does not seem to be anything that ADS can do.
A comment about the search I included in the previously attached version of the script. I added a sort a while back for debugging, but have done a test where I remove the sort. With the sort removed I get a different list of duplicates.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/andycasey/ads/issues/104#issuecomment-397613201, or mute the thread https://github.com/notifications/unsubscribe-auth/ADV6ZDgvBtzJf7FPLLXlSI819ufGBa9xks5t87BVgaJpZM4UkS7J .
I have found a workaround to the duplication issue. My sorting tests where sorting by pubdate, which I realized is not fine-grained enough to assure uniqueness from query-to-query. By sorting by a unique article identifier, I figured that I would then get unique query results from one query to the next. Indeed, by using sort='bibcode+asc' I get the right results. Spot-checked with output from the API and found to be correct.
I don't know what this says about the start and row parameters in the SearchQuery, other than it clearly does not currently allow for indexing through multiple queries.
First a bit of clarification about the two ways in which pagination is supported via the ADS API. When start
is not supplied, the follow-up queries use SOLR's cursorMark
feature which is supposed to provide efficient pagination of the search results. This is what I would recommend everybody do, but obviously we can't enforce it. Our back-end architecture was built with the intent to route the follow-up queries to the same SOLR instance so that cursorMark
can be honored, but it's possible that may be issues with our implementation (or simply the fact that the SOLR instance which responded to the original query goes out of service and subsequent queries are not able to be satisfied). We welcome any feedback on this.
Now as to the current issue(s):
It appears that there is a bug in the library code which does not properly account for the number of remaining records when pagination occurs via start
and rows
. This should be a relatively easy fix on your end, I believe as @vsudilov and @jonnybazookatone have identified the problem.
Even with this bug fixed (or with the suggested workaround applied) there may be an issue with inconsistent results being returned due to randomness in the way the list of result is ordered. SOLR orders its results by score or, when provided, a specific sort order. I'm not 100% sure about this but I suspect that in a list of results, records which have the same score may appear in different order when the same query is re-issued at a later time, unless the sort order forces a unique ordering. Which means that a follow-up query with the same q
but a different start
may have the order of a few records changed.
Related to the above (and to my surprise) I just noticed that the default sort order is changed between initial and follow-up query in the library (see https://github.com/andycasey/ads/blob/master/ads/search.py#L388). This seems wrong to me, as it potentially triggers non-deterministic ranking which can lead to record duplication or miss when the results are retrieved chunk after chunk via the start
mechanism. To avoid these problems, I would always suggest adding a secondary sort which makes the list deterministic (such as id desc
) whether or not a sort order is specified by the user.
Encouraging additional feedback on this by @romanchyla and @marblestation and FYI @ehenneken
Thank you for the insight @aaccomazzi. It appears that (completely by accident) I have hit on a reasonable solution to this problem. Just out of curiosity, is it better to use a sort by id rather than bibcode (as I have done)?
Sorting by id will shave ~3 milliseconds on a query response ;-)
(I personally tend to sort by bibcode desc
so that I see the most recent articles first)
Ah! Got to have the speed! Thanks @aaccomazzi.
Re point 3, that's setting defaults during query/object initialization -- it remains unmodified during the lifetime of that object, and thus sorting is expected to remain unchanged. The object is intended to be used to fetch all results corresponding to a single logical search -- it supports fetching all results via start/rows, cursorMark, and max_pages as pagination control mechanisms.
If a user wants to instantiate multiple searchQuery objects that correspond to one logical search, it is up to the user to ensure those objects are instantiated in the same way. Apologies for not picking up on that earlier, @jmangum .
Closing due to staleness.
Why is the total number of filtered articles and citation_count cumulative number inconsistent with manual searches? Thank you!
`import ads import os import datetime as dt import pandas as pd import matplotlib import matplotlib.pyplot as plt import numpy as np
token = '*****'
def query_counts(keywords, query, year, acknowledgements=False): if acknowledgements: query = 'ack:' + query modifiers = ' '.join([f'year:{year}']) full_query = ' '.join([f"abs:('{query}')", modifiers]) filter_query = ['database:astronomy', 'property:refereed', 'doctype:article'] papers = ads.SearchQuery(q=full_query, fq=filter_query, token=token, sort="citation_count") papers.execute() results_count = int(papers.response.numFound)
citation_count_num = 0
for n in papers.articles:
citation_count_num += n.citation_count
print(modifiers, full_query, results_count, citation_count_num)
return dict(keywords=keywords, query=query, year=year, count=results_count, citation_count_num=citation_count_num)
DATA = { 'LAMOST': ['LAMOST'], 'SDSS': ['SDSS'], 'SDSS_Official': ['"BOSS" OR "APOGEE" OR "eBOSS" OR "MARVELS" OR "MANGA" OR "SDSS" OR ("Sloan" AND "Survey")) OR ' 'title:("BOSS" OR "APOGEE" OR "eBOSS" OR "MARVELS" OR "MANGA" OR "SDSS" OR ("Sloan" AND ' '"Survey")'], 'SDSS Spectrum': ['SDSS Spectrum'], }
filename = 'ADS_results1.csv' years = [] for y in range(2022, 2023): years.append(str(y)) years.append('1994-2022') if not os.path.exists(filename): results = pd.DataFrame([query_counts(keywords, query, year) for keywords, queries in DATA.items() for query in queries for year in years]) results.to_csv(filename, index=False)`
Your UI filters show that you are filtering by requiring a publication type of "Article." Because this is a hierarchical filter, it's a little tricky to deal with. Publication Types of "Article" corresponds to the union of the following doctypes: article
(for journal articles), eprint
(for preprints), inbook
(for book chapters), and inproceedings
(for conference proceedings articles), so the proper query is: doctype:(article OR eprint OR inbook OR inproceedings)
. The filter query accomplishes this by using a slightly different approach which is a bit cryptic, but FYI: doctype_facet_hier:0/Article
. Either method will work fine for what you want.
您的 UI 筛选器显示您正在通过要求出版物类型“文章”进行筛选。因为这是一个分层过滤器,所以处理起来有点棘手。“文章”的出版物类型对应于以下文档类型的并集:(对于期刊文章),(对于预印本),(对于书籍章节)和(对于会议论文集文章),因此正确的查询是:。筛选器查询通过使用略有不同的方法来实现这一点,这种方法有点神秘,但仅供参考:。任何一种方法都可以正常工作。
article``eprint``inbook``inproceedings``doctype:(article OR eprint OR inbook OR inproceedings)``doctype_facet_hier:0/Article
Thank you very much! Your tips were very helpful to me and now work properly. Also, can you provide a use case (preferably code) for CursorMark?
You should never need to use CursorMark as it's a server-generated identifier that allows efficient paginating of results. This library (in the SearchQuery class) properly deals with managing CursorMark for you when follow-up queries are requested. If you want all results to be returned, simply set a high max_pages
parameter and they will be fetched for you iteratively.
You should never need to use CursorMark as it's a server-generated identifier that allows efficient paginating of results. This library (in the SearchQuery class) properly deals with managing CursorMark for you when follow-up queries are requested. If you want all results to be returned, simply set a high
max_pages
parameter and they will be fetched for you iteratively.
I've tried increasing the max_pages parameters, but I still only return 2000 results, and I can't get correct citations for more than 2,000 articles. I still need your guidance, thank you very much! Here's the code I tried using cursorMark. My first modified code has been put into my ads_test repository.
papers = ads.SearchQuery(q=full_query, fq=filter_query, token=token, rows=2000, max_pages=100, sort="citation_count")
This is what I use and it works fine with pagination: https://gist.github.com/aaccomazzi/b205a41fcee5f31065816eb9f06f748a
For example:
$ python adsquery.py --pages 2000 --format csv SDSS > SDSS.csv
$ wc -l SDSS.csv
22692 SDSS.csv
This is what I use and it works fine with pagination: https://gist.github.com/aaccomazzi/b205a41fcee5f31065816eb9f06f748a
For example:
$ python adsquery.py --pages 2000 --format csv SDSS > SDSS.csv $ wc -l SDSS.csv 22692 SDSS.csv
For example, I want to get the number of all SDSS articles from 1994 to 2022, which can be achieved with SearchQuery and get the correct result. This result has clearly exceeded the maximum value of 2000, and even with a large max_pages, only 2000 articles can be seen through Pycharm's debug. Therefore, I cannot accumulate the citations of 10,419 articles, but only get the total citations of the first 2,000 articles. I wanted to ask if I could use cursorMark instead of the while loop in the ads_test in my repository to add up the citations of 10,419 articles. (ads_test: https://github.com/oneLuckyfish/ads_test/blob/main/ads_test_2.py Thank you again for your patience!
The script in question does not set max_pages
at all, which is the problem.
Use the logic implemented in my script and it will work, no need to use the start parameter because pagination is taken care of already.
Hello,
I am trying to extract citation statistics for various journals by running four queries looped over a range in years. I have learned that I need to each of the "articles" and "zeroarticles" queries in two iterations to get around the (apparently) hardcoded limit of 2000 rows:
for yr in yearlist: articles = list(ads.SearchQuery(q="(year:"+yr+" bibstem:"+journal+" AND citation_count:[0 TO 999990]) +property:refereed -title:erratum",fl=fllist,rows=2000)) articles+=list(ads.SearchQuery(q="(year:"+yr+" bibstem:"+journal+" AND citation_count:[0 TO 999990]) +property:refereed -title:erratum",fl=fllist,start=2000,rows=2000)) zeroarticles = list(ads.SearchQuery(q="(year:"+yr+" bibstem:"+journal+" AND citation_count:[0 TO 0]) +property:refereed -title:erratum",fl=fllist,rows=3000)) zeroarticles+=list(ads.SearchQuery(q="(year:"+yr+" bibstem:"+journal+" AND citation_count:[0 TO 0]) +property:refereed -title:erratum",fl=fllist,start=2000,rows=2000))
Running this results in an index out of range error:
---> 37 articles+=list(ads.SearchQuery(q="(year:"+yr+" bibstem:"+journal+" AND citation_count:[0 TO 999990]) +property:refereed -title:erratum",fl=fllist,start=2000,rows=2000))
/Users/jmangum/anaconda/lib/python2.7/site-packages/ads/search.pyc in next(self) 490 491 def next(self): --> 492 return self.next() 493 494 def next(self):
/Users/jmangum/anaconda/lib/python2.7/site-packages/ads/search.pyc in next(self) 519 # extended .articles array. 520 self.execute() --> 521 cur = self._articles[self.__iter_counter] 522 523 self.__iter_counter += 1
IndexError: list index out of range
Is it possible that the package is not expecting the start parameter? Thanks.
-- Jeff
Environment Information:
torgo:Stats jmangum$ python -c "import ads; print(ads.version)" 0.12.3 torgo:Stats jmangum$ python -V Python 2.7.14 :: Anaconda custom (64-bit)