How these metrics are gathered for the monthly reports
At the beginning of each month @jggautier runs a Python script that:
uses Dataverse's Native and Search APIs to gather the persistent IDs, months of publications, storage sizes of the latest published versions, and metadata matches of datasets published in the previous month where the metadata includes the names and acronyms of NIH centers and institutes (see Search details for more information)
scrapes each dataset's page to get its file download count
uses the DataCite API to get each published dataset's citation count
creates a CSV file with the following information for each published dataset: PID URL, publication month, citation count, file download count, storage size, and where NIH institute names and acronyms appear in the metadata
reports the PIDs of any datasets that were removed (datasets in the previous month's report that aren't in the current month's) and datasets that were added (datasets in the current month's report that aren't not in the previous month's)
@jggautier then reviews any datasets that were included in previous months but removed, reviews the metadata of newly added datasets to make sure there's actually some indication of NIH funding, removes any datasets that aren't from NIH-funded research, and adjusts the script so that those datasets are ignored when the script is used again. The script is also adjusted to include datasets that @jggautier and colleagues know have been funded by the NIH and are missing such indications in their metadata.
Search details
The Python script uses the Search API to look across four metadata fields - Funding Information Agency, Contributor Name, Description, and Notes - for the full name of the NIH and its acronym and the full names of all NIH centers and institutes and most of their acronyms.
When looking through metadata in the Description field and Notes field, the script also looks for variations of the words "fund", "sponsor", "award", and "support" to increase the chances that it finds only datasets with metadata that acknowledges NIH funding.
Overview
Tracking issue for monthly reports of NIH-funded datasets in Harvard Dataverse.
Resources
How these metrics are gathered for the monthly reports
At the beginning of each month @jggautier runs a Python script that:
@jggautier then reviews any datasets that were included in previous months but removed, reviews the metadata of newly added datasets to make sure there's actually some indication of NIH funding, removes any datasets that aren't from NIH-funded research, and adjusts the script so that those datasets are ignored when the script is used again. The script is also adjusted to include datasets that @jggautier and colleagues know have been funded by the NIH and are missing such indications in their metadata.
Search details The Python script uses the Search API to look across four metadata fields - Funding Information Agency, Contributor Name, Description, and Notes - for the full name of the NIH and its acronym and the full names of all NIH centers and institutes and most of their acronyms.
When looking through metadata in the Description field and Notes field, the script also looks for variations of the words "fund", "sponsor", "award", and "support" to increase the chances that it finds only datasets with metadata that acknowledges NIH funding.