chanzuckerberg / single-cell

A collection of documents that reflect various design decisions that have been made for the cellxgene project.
MIT License
4 stars 2 forks source link

Increase the velocity of dataset schema evolution and dataset migration #365

Closed brianraymor closed 9 months ago

brianraymor commented 1 year ago

Goal Alignment

H2 Goal 2: The quality of CZ Census metadata is improved.


Technical Design

Tech Spec: Increase the velocity of dataset schema evolution and dataset migration.


Aspiration

This is the next step for Discover tooling is upgraded so that data submitters have access to up-to-date annotations, Data integrators have access to key fields for data integration, and it is easier for curators to execute data migrations.

Ontology-only updates and migrations must be performed on a frequent cadence (monthly) with limited curator intervention and without the requirement to always download and upload revised datasets. This implicitly means that the upload of a dataset should not be the only trigger for the validation and conversion steps in the ingestion pipeline.

As @ambrosejcarr recently wrote:

Curation and migration ensure that the datasets we host retain their value longer ... by ensuring that datasets have annotations that reflect the most up to date thought on biological concepts and enable the data to be used, compared, and integrated more seamlessly.


Conceptual Workflow for Migrations

This will build on the dry run and curator scripting model created in Increase the velocity of ontology-only dataset schema updates and dataset migration

Ordering of Operations

When refinements are made to existing metadata fields, there may be cases where multiple fields are impacted and operations must be ordered. During schema 4.0.0 migration:

  1. The new tissue_type field is annotated by detecting the " (organoid)" or " (cell culture)" suffix for tissue_ontology_term_id.
  2. Then, that suffix must be removed from tissue_ontology_term_id.
  3. Then, validation can occur.
  4. tissue is corrected when its label is re-applied.

dry-run

A dry-run is performed to assess the impact from schema field changes and to collect guidance from Lattice for cases requiring curator intervention prior to updating datasets.

Automate Updates to Ontologies and Gene References

Note: Refined and addressed in Increase the velocity of ontology-only dataset schema updates and dataset migration.


Add Metadata field

Metadata fields may be added. It must be possible for curators to script annotations for added metadata fields that curators MUST annotate.

The following metadata fields are added in schema 4.0.0, but DO NOT require curator intervention:

Note: tissue_type is the only addition that curators MUST annotate for new datasets.

Nonetheless, the design must support future automation.

IF ALL or MOST values for an added metadata field that curators MUST annotate cannot be automatically annotated during migration, then this is communicated to curators during the schema design rather than reporting during the dry-run. A previous example in schema 3.0.0 was donor_id.

IF SOME values for an added metadata field that curators MUST annotate cannot be automatically annotated, THEN document the following in curator report:

A previous example in schema 3.0.0 was suspension_type.


Update Metadata field requirements

Metadata fields may have updated requirements that introduce fresh failures. Updates may include stricter requirements, changes to the format, or changes to the type.

It must be possible for curators to script annotations for metadata fields with updated requirements that curators MUST annotate.

The following metadata fields that curators MUST annotate have updated requirements in schema 4.0.0:

A previous example in schema 3.0.0 was:

For historical reasons, there are three cases where curated fields exist in many datasets that were neither documented in the schema nor fully validated by cellxgene-schema CLI:

These cases are similar to updating the requirements for an existing metadata field.

IF an existing value is now invalid and cannot be corrected, THEN document the following in curator report:

Note: {column}_colors validation may produce many failures in the dry-run due to stricter validation. Potential failures may include:

  1. {column} references a label field instead of a term field. The {column} can be renamed from the label field to the term field if the term field is NOT also present as a {column} which would create a conflict that would require curator intervention. Per @jahilton - we could also choose to simply delete such offending {column}(s) because no one noticed the issue.
  2. The format for the value is RGB and must be converted.

Deprecate Metadata field

Deprecated (deleted) metadata fields must be deleted. There are no cases in schema 4.0.0. Nonetheless, the design must support future automation.

A previous example in schema 3.0.0 was:


Rename Metadata field

Metadata fields may be renamed. There are no cases in schema 4.0.0. Nonetheless, the design must support future automation.

Previous examples in schema 3.0.0 were:


Warnings from new Anndata versions

See #single-cell-four.

Example:

2023-02-21 22:15:44 804754  WARNING  /home/bruce/cell-census/venv/lib/python3.9/site-packages/anndata/compat/__init__.py:263: FutureWarning: During AnnData slicing, found matrix at .uns['neighbors_hm']['connectivities'] that happens to be dimensioned at n_obs×n_obs (4329×4329).

These matrices should now be stored in the .obsp attribute.
This slicing behavior will be removed in anndata 0.8.

IF warnings occur, THEN document the following in the curator report:

More needs to be understood about warnings before an automated solution can be proposed.


Automate preprint DOI to published DOI updates

Update the DOI in the A single-cell transcriptional roadmap of the mouse and human lymph node lymphatic vasculature collection is an example of an updated DOI that was discovered during prototyping. If an existing preprint DOI is queried again AND it has been published since the previous query, then Crossref returns the published DOI in:

if is_preprint:
    try:
        published_doi = message['relation']['is-preprint-of']
       # the new DOI to query for ...
        if published_doi[0]['id-type'] == 'doi' :
            display(published_doi[0]['id'])
    except KeyError:
        pass

This would allow the portal to refresh preprint DOI(s) with their published DOI(s) on a regular cadence.

Lattice discovered a case in the past where the publishers failed to update the relationship between a preprint DOI and its publication DOI.


Automate Dataset Title updates

This automation is out of scope for this iteration.

This feature has been requested in the past.

There have also been conversations in #single-cell-data-wrangling about better naming guidelines that might require an editorial pass on dataset titles in the future.


Inspiration

@jahilton shared _Opportunities to optimize_ based on the Lattice experience with the recent 3.0.0 migration.

Earlier blue-skying on Validation/Submission strategy (aka Automating Migration)

Review LinkML

metakuni commented 1 year ago

Updated epic as:

The end date will be further refined after the schema 4 tech spec, epics/issues and estimates are refreshed by @nayib-jose-gloria in the week of 2023-09-11.

metakuni commented 1 year ago

Schema 4 migration completion moved to 2023-12-01 based on project re-estimation and extension of CI/CD work.