borisveytsman / SoftwareImpactHackathon2023_Tracing_dependencies

Tracing the dependencies of open source software mentioned in the biomedical literature
8 stars 1 forks source link

Get Python code for Wikidata query to get source code repos of software used by a paper identified by uppercase-normalized DOI #1

Open Daniel-Mietchen opened 10 months ago

Daniel-Mietchen commented 10 months ago

What we would like is to provide a DOI (which has to be uppercase-normalized, by Wikidata convention) and then find the source code repos for any software that Wikidata knows as having been used (via P4510) in the paper identified by the provided DOI.

Daniel-Mietchen commented 10 months ago

Here is a Wikidata query for that:

#title: Source code repos of software used by a paper identified by uppercase-normalized DOI
SELECT ?paper ?repo WHERE {
  VALUES ?doi { "10.1371/JOURNAL.PONE.0134894"}
  ?paper wdt:P356 ?doi ;
         wdt:P4510 ?software .
  ?software wdt:P1324 ?repo .
}

It can be run directly on the Wikidata SPARQL endpoint or via the following Python snippet:

# pip install sparqlwrapper
# https://rdflib.github.io/sparqlwrapper/

import sys
from SPARQLWrapper import SPARQLWrapper, JSON

endpoint_url = "https://query.wikidata.org/sparql"

query = """#title: Source code repos of software used by a paper identified by uppercase-normalized DOI
SELECT ?paper ?repo WHERE {
  VALUES ?doi { "10.1371/JOURNAL.PONE.0134894"}
  ?paper wdt:P356 ?doi ;
         wdt:P4510 ?software .
  ?software wdt:P1324 ?repo .
}"""

def get_results(endpoint_url, query):
    user_agent = "WDQS-example Python/%s.%s" % (sys.version_info[0], sys.version_info[1])
    # TODO adjust user agent; see https://w.wiki/CX6
    sparql = SPARQLWrapper(endpoint_url, agent=user_agent)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    return sparql.query().convert()

results = get_results(endpoint_url, query)

for result in results["results"]["bindings"]:
    print(result)
Daniel-Mietchen commented 10 months ago

Here is a query that gives the list of source code repos for which Wikidata has information about papers using the corresponding software.

Daniel-Mietchen commented 10 months ago

Here is a list of upper-case normalized DOIs for which Wikidata knows at least some software that (1) has been used in the corresponding paper and (2) has its source code repo indicated in Wikidata

Daniel-Mietchen commented 10 months ago

Here is a list of upper-case normalized DOIs for which Wikidata knows at least some software that (1) has been used in the corresponding paper and (2) has its CRAN repo indicated in Wikidata.

Daniel-Mietchen commented 10 months ago

Here is a list of CRAN packages sorted by number of papers for which Wikidata knows at least one paper having used the package.

Daniel-Mietchen commented 10 months ago

Here is a list of Bioconductor packages sorted by number of papers for which Wikidata knows at least one paper having used the package.