fabiobatalha / crossrefapi

A python library that implements the Crossref API.
BSD 2-Clause "Simplified" License
265 stars 44 forks source link

Unexpected query output #35

Closed OBrink closed 3 years ago

OBrink commented 3 years ago

When trying to retrieve information via simple queries, I consistently got outputs that I did not expect. Specifically, the publications which are referred to by the keywords are not returned in the result of the query. I do however get a return with the right publication data via a manual HTTP GET request.

Example code:

from crossref.restful import Works 

keyword = 'Albert Einstein Elektrodynamik bewegter Körper'

works = Works()
result = works.query(keyword)
for entry in result:
    print(entry)
    break
>> {'indexed': {'date-parts': [[2019, 11, 19]], 'date-time': '2019-11-19T19:11:52Z', 'timestamp': 1574190712445}, 'reference-count': 0, 'publisher': 'Maney Publishing', 'issue': '1', 'content-domain': {'domain': [], 'crossmark-restriction': False}, 'short-container-title': ['Journal of the American Institute for Conservation'], 'published-print': {'date-parts': [[1980]]}, 'DOI': '10.2307/3179679', 'type': 'journal-article', 'created': {'date-parts': [[2006, 4, 18]], 'date-time': '2006-04-18T05:15:34Z', 'timestamp': 1145337334000}, 'page': '21', 'source': 'Crossref', 'is-referenced-by-count': 0, 'title': ['A Semi-Rigid Transparent Support for Paintings Which Have Both Inscriptions on Their Fabric Reverse and Acute Planar Distortions'], 'prefix': '10.1179', 'volume': '20', 'author': [{'given': 'Albert', 'family': 'Albano', 'sequence': 'first', 'affiliation': []}], 'member': '138', 'container-title': ['Journal of the American Institute for Conservation'], 'deposited': {'date-parts': [[2015, 6, 26]], 'date-time': '2015-06-26T01:05:23Z', 'timestamp': 1435280723000}, 'score': 4.5581737, 'issued': {'date-parts': [[1980]]}, 'references-count': 0, 'journal-issue': {'published-print': {'date-parts': [[1980]]}, 'issue': '1'}, 'URL': 'http://dx.doi.org/10.2307/3179679', 'ISSN': ['0197-1360'], 'issn-type': [{'value': '0197-1360', 'type': 'print'}]}

I get this kind of output which has nothing to do with my input keyword with different keywords, too. I have tried modifying the order of the result [result.order('desc')] but that does not seem to change anything.

When I then do the same request via HTTP GET and the normal API URL, I get the expected output as the first result:

import requests

keyword = 'Albert Einstein Elektrodynamik bewegter Körper'

keyword = '+'.join(keyword.split())
url = 'https://api.crossref.org/works?query=' + keyword
result = requests.get(url = url)
# Take first result
result = result.json()['message']['items'][0]
print(result)

>> {'indexed': {'date-parts': [[2020, 5, 25]], 'date-time': '2020-05-25T14:23:45Z', 'timestamp': 1590416625775}, 'publisher-location': 'Wiesbaden', 'reference-count': 0, 'publisher': 'Vieweg+Teubner Verlag', 'isbn-type': [{'value': '9783663193722', 'type': 'print'}, {'value': '9783663195108', 'type': 'electronic'}], 'content-domain': {'domain': [], 'crossmark-restriction': False}, 'published-print': {'date-parts': [[1923]]}, 'DOI': '10.1007/978-3-663-19510-8_3', 'type': 'book-chapter', 'created': {'date-parts': [[2013, 12, 6]], 'date-time': '2013-12-06T02:08:43Z', 'timestamp': 1386295723000}, 'page': '26-50', 'source': 'Crossref', 'is-referenced-by-count': 5, 'title': ['Zur Elektrodynamik bewegter Körper'], 'prefix': '10.1007', 'author': [{'given': 'A.', 'family': 'Einstein', 'sequence': 'first', 'affiliation': []}], 'member': '297', 'container-title': ['Das Relativitätsprinzip'], 'link': [{'URL': 'http://link.springer.com/content/pdf/10.1007/978-3-663-19510-8_3', 'content-type': 'unspecified', 'content-version': 'vor', 'intended-application': 'similarity-checking'}], 'deposited': {'date-parts': [[2013, 12, 6]], 'date-time': '2013-12-06T02:08:45Z', 'timestamp': 1386295725000}, 'score': 53.638336, 'issued': {'date-parts': [[1923]]}, 'ISBN': ['9783663193722', '9783663195108'], 'references-count': 0, 'URL': 'http://dx.doi.org/10.1007/978-3-663-19510-8_3'}

The output that I have retrieved with the tool in this repository has nothing to do with my query keyword. Do you have an idea about how I can fix this? I would be very grateful for every kind of help.

fabiobatalha commented 3 years ago

The difference between your approach and the API, is that, the API uses some other parameters in the query to allow users to download all the documents related to the given query.

In both approaches there is a total of 290890 matched documents. You can see it testing both urls, and looking the attribute total-results.

API: https://api.crossref.org/works?query=Albert+Einstein+Elektrodynamik+bewegter+K%C3%B6rper&cursor=%2A&rows=100 Your approach: https://api.crossref.org/works?query=Albert+Einstein+Elektrodynamik+bewegter+K%C3%B6rper

As you can see, the differences between the urls are the parameters (rows=100 and cursor=*) where :

fabiobatalha commented 3 years ago

I've included a question it the Crossref API repository: https://github.com/CrossRef/rest-api-doc/issues/557

OBrink commented 3 years ago

Thank you for the quick reply! For now, I will keep working without the cursor parameter in my requests.

Ankush-Chander commented 3 years ago

Hey @OBrink,

As rightly pointed by @fabiobatalha the API applies cursor=* in the url leading to change in order.

You can achieve desired result by applying .sort("relevance") as following:

from crossref.restful import Works 

keyword = 'Albert Einstein Elektrodynamik bewegter Körper'

works = Works()
result = works.query(keyword).sort("relevance")
for entry in result:
    print(entry)
    break
>> {'indexed': {'date-parts': [[2020, 5, 25]], 'date-time': '2020-05-25T14:23:45Z', 'timestamp': 1590416625775}, 'publisher-location': 'Wiesbaden', 'reference-count': 0, 'publisher': 'Vieweg+Teubner Verlag', 'isbn-type': [{'value': '9783663193722', 'type': 'print'}, {'value': '9783663195108', 'type': 'electronic'}], 'content-domain': {'domain': [], 'crossmark-restriction': False}, 'published-print': {'date-parts': [[1923]]}, 'DOI': '10.1007/978-3-663-19510-8_3', 'type': 'book-chapter', 'created': {'date-parts': [[2013, 12, 6]], 'date-time': '2013-12-06T02:08:43Z', 'timestamp': 1386295723000}, 'page': '26-50', 'source': 'Crossref', 'is-referenced-by-count': 5, 'title': ['Zur Elektrodynamik bewegter Körper'], 'prefix': '10.1007', 'author': [{'given': 'A.', 'family': 'Einstein', 'sequence': 'first', 'affiliation': []}], 'member': '297', 'container-title': ['Das Relativitätsprinzip'], 'link': [{'URL': 'http://link.springer.com/content/pdf/10.1007/978-3-663-19510-8_3', 'content-type': 'unspecified', 'content-version': 'vor', 'intended-application': 'similarity-checking'}], 'deposited': {'date-parts': [[2013, 12, 6]], 'date-time': '2013-12-06T02:08:45Z', 'timestamp': 1386295725000}, 'score': 53.646687, 'issued': {'date-parts': [[1923]]}, 'ISBN': ['9783663193722', '9783663195108'], 'references-count': 0, 'URL': 'http://dx.doi.org/10.1007/978-3-663-19510-8_3'}

I hope that serves your purpose.

Thanks, Ankush

OBrink commented 3 years ago

@Ankush-Chander Thank you very much! That helps me getting exactly what I need.