Epic: Harvard Dataverse Repository NIH Metrics

cmbz commented 7 months ago

Overview

Tracking issue for monthly reports of NIH-funded datasets in Harvard Dataverse.

Resources

How these metrics are gathered for the monthly reports

At the beginning of each month @jggautier runs a Python script that:

uses Dataverse's Native and Search APIs to gather the persistent IDs, months of publications, storage sizes of the latest published versions, and metadata matches of datasets published in the previous month where the metadata includes the names and acronyms of NIH centers and institutes (see Search details for more information)
scrapes each dataset's page to get its file download count
uses the DataCite API to get each published dataset's citation count
creates a CSV file with the following information for each published dataset: PID URL, publication month, citation count, file download count, storage size, and where NIH institute names and acronyms appear in the metadata
reports the PIDs of any datasets that were removed (datasets in the previous month's report that aren't in the current month's) and datasets that were added (datasets in the current month's report that aren't not in the previous month's)

@jggautier then reviews any datasets that were included in previous months but removed, reviews the metadata of newly added datasets to make sure there's actually some indication of NIH funding, removes any datasets that aren't from NIH-funded research, and adjusts the script so that those datasets are ignored when the script is used again. The script is also adjusted to include datasets that @jggautier and colleagues know have been funded by the NIH and are missing such indications in their metadata.

Search details The Python script uses the Search API to look across four metadata fields - Funding Information Agency, Contributor Name, Description, and Notes - for the full name of the NIH and its acronym and the full names of all NIH centers and institutes and most of their acronyms.

When looking through metadata in the Description field and Notes field, the script also looks for variations of the words "fund", "sponsor", "award", and "support" to increase the chances that it finds only datasets with metadata that acknowledges NIH funding.

cmbz commented 7 months ago

Status: March 2024

Total NIH funded items: 290
Total storage used by NIH-funded items: 3.11 TB
Total downloads of NIH-funded items: 738,696
Total citations of NIH-funded items: 22

cmbz commented 7 months ago

Status: April 2024

Total NIH funded items: 290
- 3 datasets were added (one published in April 2024 and two published earlier)
- 3 non-NIH funded datasets were removed
Total storage used by NIH-funded items: 3.11 TB
Total downloads of NIH-funded items: 763,697
Total citations of NIH-funded items: 22

cmbz commented 6 months ago

Status: May 2024

Harvard Dataverse datasets have been indexed in the NIH's Dataset Catalog since the catalog launched in Feb. 2024
NIH-funded datasets in Harvard Dataverse
- Total NIH funded items: 293
- 3 datasets were added (one published in May 2024 and two published earlier)
- Total storage used by NIH-funded items: 3.11 TB
- Total downloads of NIH-funded items: 840,550
- Total citations of NIH-funded items: 23

cmbz commented 5 months ago

Status: June 2024

Total NIH funded items: 295
- 2 datasets were added (both published in June 2024)
Total storage used by NIH-funded items: 3.11 TB
Total downloads of NIH-funded items: 853,080
Total citations of NIH-funded items: 49

cmbz commented 4 months ago

Status: July 2024

Total NIH funded items: 300
- 5 datasets were added (all published in July 2024)
Total storage used by NIH-funded items: 3.11 TB
Total downloads of NIH-funded items: 868,460
Total citations of NIH-funded items: 50

cmbz commented 2 months ago

Status: August 2024

Total NIH funded items: 308
- 8 datasets were added (all were published in August 2024)
Total storage used by NIH-funded items: 3.11 TB
Total downloads of NIH-funded items: 963,659
Total citations of NIH-funded items: 28

cmbz commented 1 month ago

Status: September 2024

Total NIH funded items: 311
- 3 datasets were added (all were published in September 2024)
Total storage used by NIH-funded items: 3.12 TB
Total downloads of NIH-funded items: 1,003,532
Total citations of NIH-funded items: 29

cmbz commented 1 month ago

Status: October 2024

Total NIH funded items: 314
- 3 datasets were added (all were published in October 2024)
Total storage used by NIH-funded items: 3.12 TB
Total downloads of NIH-funded items: 1,010,966
Total citations of NIH-funded items: 179

cmbz commented 1 day ago

Status: November 2024

Pending

IQSS / dataverse-pm