Query the HCA Data Repository to determine if a source dataset has been ingested

NoopDog commented 6 months ago

Need

A key feature of the tracker is to check on the ingestion status of datasets in the various systems that make up the wider HCA Data Ecosystem (HCA Data Repository, CELLxGENE, CAP, DUOS) etc. When the system supports it, source datasets (and Atlases) will be synchronized by the publication DOI. When a dataset in the external system with a matching DOI is found.

User Flow

The user enters a source dataset DOI into the system using the "Create Source Dataset" flow.
The system processes the DOI, checks CrossRef for the DOI and if a matching publication is found, portions of the publication metadata are saved in a new source dataset record.
The system then checks if a matching DOI exists in the HCA Data Repository and if so adds the HCA Data Repository project ID to the source dataset record.
The source dataset edit form will show the source dataset found if any.
The entry for the source dataset on the Atlas's source dataset list will show the status of the HCA Repository ingest.

We want to be able to check for matching DOIs in real-time (sub-second) so that this check can be included in the form processing flow.

The HCA Data Repository is updated roughly monthly.

The HCA Data Repository does not have an API for finding projects by DOI.

Approach

At a high level, the approach for this can be:

Query the /index/catalogs API to determine the default catalog.
Create a map of DOI to the project by reading all of the datasets in the default catalog
Determine if the map of DOI to projects is stale by calling the /index/catalogs API and seeing if the catalog has changed.
Refresh the DOI to project map if the default catalog has changed
Query the map by DOI to determine the project ID, if any, and save it on the source dataset.

Default Catalog

The HCA Data Repository content is stable across "Catalogs." There is an index/catalogs API that returns a list of catalogs and an indicator of the default_catalog. The /index/catalogs response returns fairly quickly (about 300ms).

Here the default catalog is dcp35

Retrieving the list of Datasets in a Catalog

We have done this many times, so we should have some code we can re-use that cycles through the pages in the /projects API response to get all projects

Caching the HCA Data Repository Response

This can be done in nodejs memory on startup with checking of stale before each use. I don't think we need a DB table for this.

Determining if the HCA Data Repository has been updated

Check if the default catalog has changed from when the cache was created.

Checking if a dataset with a matching DOI exists

Check the cache using the DOI as akey.

NoopDog commented 6 months ago

Note @hunterckx that I think the DOI is in the /projects response so no need to call/projectd/id for each project

NoopDog commented 6 months ago

Complete! Thx

clevercanary / hca-atlas-tracker