Closed ernestogimeno closed 4 years ago
calling Dimensions full_text_search throws this exception Traceback (most recent call last): File "C:/Users/Ernesto/PycharmProjects/RCGraph/federated_search.py", line 64, in main meta = api.full_text_search(search_term=search_terms, limit=limit) File "C:\Users\Ernesto\AppData\Local\Programs\Python\Python38-32\lib\site-packages\richcontext\scholapi\scholapi.py", line 374, in full_text_search self.login() File "C:\Users\Ernesto\AppData\Local\Programs\Python\Python38-32\lib\site-packages\richcontext\scholapi\scholapi.py", line 310, in login username=self.parent.config["DEFAULT"]["email"], File "C:\Users\Ernesto\AppData\Local\Programs\Python\Python38-32\lib\configparser.py", line 1254, in getitem raise KeyError(key) KeyError: 'email'
It looks like PubMed implements full_text_search without a "limit" parameter
Traceback (most recent call last): File "C:/Users/Ernesto/PycharmProjects/RCGraph/federated_search.py", line 64, in main meta = api.full_text_search(search_term=search_terms, limit=limit) TypeError: full_text_search() got an unexpected keyword argument 'limit'
For the PubMed full_text_search
here's the function definition in master
:
def full_text_search (self, search_term, limit=None, exact_match=None):
Might be good to check with pip freeze | grep rich
and see if your Python environment is using this library from local repo source code or from a pip
install?
Oh, the Dimensions requirement for an email
config param -- I'll send you an update in Slack. That API call requires a username and password, where username
is your email address.
I've added one test, to check that meta != None
(which means "no results")
There's a problem with richcontext.scholapi
in the current release:
Traceback (most recent call last):
File "federated_search.py", line 159, in main
meta, timing, message = api.full_text_search(search_term=search_terms, limit=limit)
File "/opt/anaconda3/lib/python3.7/site-packages/richcontext/scholapi/scholapi.py", line 818, in full_text_search
id_list = self._full_text_get_ids(search_term, limit)
File "/opt/anaconda3/lib/python3.7/site-packages/richcontext/scholapi/scholapi.py", line 795, in _full_text_get_ids
if limit != None and limit > 0 and isinstance(limit, int):
TypeError: '>' not supported between instances of 'str' and 'int'
PubMed exception calling full_text_search
Didn't show up in the unit tests... I didn't write the PubMed access, but I think I can fix that quickly tomorrow. For now, it's a way to test exception handling :)
Sometimes pubmed returns titles in this form (normally is a str):
'ArticleTitle': { 'i': 'Oncorhynchus mykiss', '#text': 'Temporal Dynamics of DNA Methylation Patterns in Response to Rearing Juvenile Steelhead () in a Hatchery versus Simulated Stream Environment.' }
In this example, the actual title is "Temporal Dynamics of DNA Methylation Patterns in Response to Rearing Juvenile Steelhead (Oncorhynchus mykiss) in a Hatchery versus Simulated Stream Environment " (DOI: '10.3390/genes10050356')
@ceteri Does it worth to look into this in detail? For now, the parser is extracting only the '#text' field.
The latest results look really good. Here's one that I ran:
$ python federated_search.py "sea level inundation" 10
terms sea level inundation
limit 10
4400 publications
3129 known DOIs
3647 known titles
PubMed implements full_text_search
OpenAIRE implements full_text_search
Dimensions implements full_text_search
Semantic Scholar implements full_text_search
dissemin implements full_text_search
SSRN implements full_text_search
EuropePMC implements full_text_search
RePEc implements full_text_search
#known_hits 1
#new_overlapped_hits 2
#new_unique_hits 21
Then was with the three dictionaries output as a JSON file:
How does that look?
For next steps, in the "overlap" category, we could still probably fold those together more?
I removed the duplicates in the "overlap" category -for now just when all fields are an exact duplicate.
I ran this to get more results:
federated_search.py "sea level inundation" 150
terms sea level inundation
limit 150
4400 publications
3129 known DOIs
3647 known titles
PubMed implements full_text_search
OpenAIRE implements full_text_search
Dimensions implements full_text_search
Semantic Scholar implements full_text_search
dissemin implements full_text_search
SSRN implements full_text_search
EuropePMC implements full_text_search
RePEc implements full_text_search
#known_hits 1
#new_overlapped_hits 25
#new_unique_hits 265
Here it is the output: federated.txt
How would you prefer to handle these cases?
1)
{
"api": "openaire",
"doi": "10.1007/s11852-018-0605-1",
"title": "Adding to the toolbox for tidal-inundation mapping in estuarine areas",
"url": "https://europepmc.org/articles/PMC6416087/"
},
{
"api": "pubmed",
"doi": "10.1007/s11852-018-0605-1",
"title": "Adding to the toolbox for tidal-inundation mapping in estuarine areas.",
"url": "https://www.ncbi.nlm.nih.gov/pubmed/30881203"
}
2)
{
"api": "openaire",
"doi": "10.1007/s00367-012-0317-8",
"title": "The importance of the vertical accuracy of digital elevation models in gauging inundation by sea level rise along the Valdelagrana beach and marshes (Bay of Cádiz, SW Spain)",
"url": "http://dx.doi.org/10.1007/s00367-012-0317-8"
},
{
"api": "openaire",
"doi": "10.1007/s00367-012-0317-8",
"title": "The importance of the vertical accuracy of digital elevation models in gauging inundation by sea level rise along the Valdegrana beach and marshes (Bay of Cádiz, SW Spain)",
"url": "https://idus.us.es/xmlui/handle/11441/43874"
}
Great, that's looking really good!
Here's a code fragment for comparing a set of titles with minor variations https://github.com/Coleridge-Initiative/RCApi/blob/master/richcontext/scholapi/scholapi.py#L954
That should probably be close enough for these kinds of cases where there's only a punctuation change or minor misspelling?
The threshold value of 0.9
may need to be adjusted. It can probably be set higher?
At line 65,
From there, you can compare with the
known_title
andknown_doi
sets to see if this publication is already in the corpus.If not, then you can begin to group by doi and title.