Coleridge-Initiative / RCGraph

Rich Context knowledge graph management
https://rc.coleridgeinitiative.org/?radius=3&entity=NOAA
Creative Commons Zero v1.0 Universal
3 stars 2 forks source link

Create a federated search script #47

Closed ernestogimeno closed 4 years ago

ceteri commented 4 years ago

At line 65,

if "title" in meta:
    title = meta["title"]

if "doi" in meta:
    doi = graph.publications.verify_doi(meta["doi"])

From there, you can compare with the known_title and known_doi sets to see if this publication is already in the corpus.

If not, then you can begin to group by doi and title.

ernestogimeno commented 4 years ago

calling Dimensions full_text_search throws this exception Traceback (most recent call last): File "C:/Users/Ernesto/PycharmProjects/RCGraph/federated_search.py", line 64, in main meta = api.full_text_search(search_term=search_terms, limit=limit) File "C:\Users\Ernesto\AppData\Local\Programs\Python\Python38-32\lib\site-packages\richcontext\scholapi\scholapi.py", line 374, in full_text_search self.login() File "C:\Users\Ernesto\AppData\Local\Programs\Python\Python38-32\lib\site-packages\richcontext\scholapi\scholapi.py", line 310, in login username=self.parent.config["DEFAULT"]["email"], File "C:\Users\Ernesto\AppData\Local\Programs\Python\Python38-32\lib\configparser.py", line 1254, in getitem raise KeyError(key) KeyError: 'email'

ernestogimeno commented 4 years ago

It looks like PubMed implements full_text_search without a "limit" parameter

Traceback (most recent call last): File "C:/Users/Ernesto/PycharmProjects/RCGraph/federated_search.py", line 64, in main meta = api.full_text_search(search_term=search_terms, limit=limit) TypeError: full_text_search() got an unexpected keyword argument 'limit'

ceteri commented 4 years ago

For the PubMed full_text_search here's the function definition in master:

def full_text_search (self, search_term, limit=None, exact_match=None):

Might be good to check with pip freeze | grep rich and see if your Python environment is using this library from local repo source code or from a pip install?

ceteri commented 4 years ago

Oh, the Dimensions requirement for an email config param -- I'll send you an update in Slack. That API call requires a username and password, where username is your email address.

ceteri commented 4 years ago

I've added one test, to check that meta != None (which means "no results")

There's a problem with richcontext.scholapi in the current release:

Traceback (most recent call last):
  File "federated_search.py", line 159, in main
    meta, timing, message = api.full_text_search(search_term=search_terms, limit=limit)
  File "/opt/anaconda3/lib/python3.7/site-packages/richcontext/scholapi/scholapi.py", line 818, in full_text_search
    id_list = self._full_text_get_ids(search_term, limit)
  File "/opt/anaconda3/lib/python3.7/site-packages/richcontext/scholapi/scholapi.py", line 795, in _full_text_get_ids
    if limit != None and limit > 0 and isinstance(limit, int):
TypeError: '>' not supported between instances of 'str' and 'int'
PubMed exception calling full_text_search

Didn't show up in the unit tests... I didn't write the PubMed access, but I think I can fix that quickly tomorrow. For now, it's a way to test exception handling :)

ernestogimeno commented 4 years ago

Sometimes pubmed returns titles in this form (normally is a str):

'ArticleTitle': { 'i': 'Oncorhynchus mykiss', '#text': 'Temporal Dynamics of DNA Methylation Patterns in Response to Rearing Juvenile Steelhead () in a Hatchery versus Simulated Stream Environment.' }

In this example, the actual title is "Temporal Dynamics of DNA Methylation Patterns in Response to Rearing Juvenile Steelhead (Oncorhynchus mykiss) in a Hatchery versus Simulated Stream Environment " (DOI: '10.3390/genes10050356')

@ceteri Does it worth to look into this in detail? For now, the parser is extracting only the '#text' field.

ceteri commented 4 years ago

The latest results look really good. Here's one that I ran:

$ python federated_search.py "sea level inundation" 10 
terms sea level inundation
limit 10
4400 publications
3129 known DOIs
3647 known titles
PubMed implements full_text_search
OpenAIRE implements full_text_search
Dimensions implements full_text_search
Semantic Scholar implements full_text_search
dissemin implements full_text_search
SSRN implements full_text_search
EuropePMC implements full_text_search
RePEc implements full_text_search
#known_hits 1
#new_overlapped_hits 2
#new_unique_hits 21

Then was with the three dictionaries output as a JSON file:

federated.txt

How does that look?

For next steps, in the "overlap" category, we could still probably fold those together more?

ernestogimeno commented 4 years ago

I removed the duplicates in the "overlap" category -for now just when all fields are an exact duplicate.

I ran this to get more results:

federated_search.py "sea level inundation" 150
terms sea level inundation
limit 150
4400 publications
3129 known DOIs
3647 known titles
PubMed implements full_text_search
OpenAIRE implements full_text_search
Dimensions implements full_text_search
Semantic Scholar implements full_text_search
dissemin implements full_text_search
SSRN implements full_text_search
EuropePMC implements full_text_search
RePEc implements full_text_search
#known_hits 1
#new_overlapped_hits 25
#new_unique_hits 265

Here it is the output: federated.txt

How would you prefer to handle these cases?

1)

{
            "api": "openaire",
            "doi": "10.1007/s11852-018-0605-1",
            "title": "Adding to the toolbox for tidal-inundation mapping in estuarine areas",
            "url": "https://europepmc.org/articles/PMC6416087/"
        },
        {
            "api": "pubmed",
            "doi": "10.1007/s11852-018-0605-1",
            "title": "Adding to the toolbox for tidal-inundation mapping in estuarine areas.",
            "url": "https://www.ncbi.nlm.nih.gov/pubmed/30881203"
        }

2)

        {
            "api": "openaire",
            "doi": "10.1007/s00367-012-0317-8",
            "title": "The importance of the vertical accuracy of digital elevation models in gauging inundation by sea level rise along the Valdelagrana beach and marshes (Bay of Cádiz, SW Spain)",
            "url": "http://dx.doi.org/10.1007/s00367-012-0317-8"
        },
        {
            "api": "openaire",
            "doi": "10.1007/s00367-012-0317-8",
            "title": "The importance of the vertical accuracy of digital elevation models in gauging inundation by sea level rise along the Valdegrana beach and marshes (Bay of Cádiz, SW Spain)",
            "url": "https://idus.us.es/xmlui/handle/11441/43874"
        }
ceteri commented 4 years ago

Great, that's looking really good!

Here's a code fragment for comparing a set of titles with minor variations https://github.com/Coleridge-Initiative/RCApi/blob/master/richcontext/scholapi/scholapi.py#L954

That should probably be close enough for these kinds of cases where there's only a punctuation change or minor misspelling?

The threshold value of 0.9 may need to be adjusted. It can probably be set higher?