ebi-ait / hca-ebi-wrangler-central

This repo is for tracking work related to wrangling datasets for the HCA, associated tasks and for maintaining related documentation.
https://ebi-ait.github.io/hca-ebi-wrangler-central/
Apache License 2.0
7 stars 2 forks source link

NRES & bionetwork backfill #1301

Open arschat opened 1 week ago

arschat commented 1 week ago

There are two bulk metadata updates on the project level, that we'd like to do.

Reasoning

  1. NRES addition in all open access datasets After the introduction of managed access datasets in the portal, we would like to add the data_use_restriction field in the metadata of all open access projects i.e. all projects of the portal that this update was not done in the previous bulk update in #1270. This would require bumping the project schema version to version 19.0.0 and add the field "data_use_restriction": "NRES" in the project metadata.
  2. Bionetwork backfilling Dave asked us to add the bionetwork information in the schema, since portal started showing the biological network on the front page by default. There are a couple of open questions here. a. what is the true list for bionetworks? Is it tracker? b. what is the true list for atlas names? In tracker some atlas names are initials (i.e. MSK 1.0, or ORCF 1.0). Do we want to add these names? c. Projects in portal with no bionetwork: would we like to show None instead of unspecified?

Plan

Since both metadata exist in the project level, we would like to update using @idazucchi 's script which exports only project metadata (don't have to update the state to graph valid, just return to exported). The steps would be:

  1. Select projects (uuids) that need update for NRES
  2. Select projects (uuids) that need bionetwork update & appropriate bionetwork(s)
  3. Select projects (uuids) that need atlas name & version update & appropriate atlas name(s) & version(s)
  4. Write script that via api calls to ingest, will update these informations
  5. Export project metadata via Ida's script
  6. Bulk import form sent to Travis

1,2,3 tasks can be done via the Task tracker spreadsheet 4 script is almost ready for previous bulk update in #1270 (see comments for script) a few modifications might be needed 5 if we provide uuids to script it runs quickly 6 we can also extract project title in order to populate the import form easily Estimated time needed ~2 days

Risks

  1. information on tracker is not up to date
    • we will update project or re-run this script for bulk updates in a next release
  2. old project gets error in import validation
    • drop project from current release & investigate how we can re-export to avoid errors
    • ask from import team to re-populate staging area with reverse-import script & try again
arschat commented 6 days ago

communication with exec office about bionetwork content