Closed brianraymor closed 9 months ago
Updated epic as:
The end date will be further refined after the schema 4 tech spec, epics/issues and estimates are refreshed by @nayib-jose-gloria in the week of 2023-09-11.
Schema 4 migration completion moved to 2023-12-01 based on project re-estimation and extension of CI/CD work.
Goal Alignment
H2 Goal 2: The quality of CZ Census metadata is improved.
Technical Design
Tech Spec: Increase the velocity of dataset schema evolution and dataset migration.
Aspiration
This is the next step for Discover tooling is upgraded so that data submitters have access to up-to-date annotations, Data integrators have access to key fields for data integration, and it is easier for curators to execute data migrations.
Ontology-only updates and migrations must be performed on a frequent cadence (monthly) with limited curator intervention and without the requirement to always download and upload revised datasets. This implicitly means that the upload of a dataset should not be the only trigger for the validation and conversion steps in the ingestion pipeline.
As @ambrosejcarr recently wrote:
Conceptual Workflow for Migrations
This will build on the dry run and curator scripting model created in Increase the velocity of ontology-only dataset schema updates and dataset migration
Ordering of Operations
When refinements are made to existing metadata fields, there may be cases where multiple fields are impacted and operations must be ordered. During schema 4.0.0 migration:
tissue_type
field is annotated by detecting the" (organoid)"
or" (cell culture)"
suffix fortissue_ontology_term_id
.tissue_ontology_term_id
.tissue
is corrected when its label is re-applied.dry-run
A dry-run is performed to assess the impact from schema field changes and to collect guidance from Lattice for cases requiring curator intervention prior to updating datasets.
Automate Updates to Ontologies and Gene References
Note: Refined and addressed in Increase the velocity of ontology-only dataset schema updates and dataset migration.
Add Metadata field
Metadata fields may be added. It must be possible for curators to script annotations for added metadata fields that curators MUST annotate.
The following metadata fields are added in schema 4.0.0, but DO NOT require curator intervention:
citation
is a known value.feature_length
is calculated.observation_joinid
is calculated.schema_reference
is a known value.tissue_type
is a known value.Note:
tissue_type
is the only addition that curators MUST annotate for new datasets.Nonetheless, the design must support future automation.
IF ALL or MOST values for an added metadata field that curators MUST annotate cannot be automatically annotated during migration, then this is communicated to curators during the schema design rather than reporting during the dry-run. A previous example in schema 3.0.0 was
donor_id
.IF SOME values for an added metadata field that curators MUST annotate cannot be automatically annotated, THEN document the following in curator report:
A previous example in schema 3.0.0 was
suspension_type
.Update Metadata field requirements
Metadata fields may have updated requirements that introduce fresh failures. Updates may include stricter requirements, changes to the format, or changes to the type.
It must be possible for curators to script annotations for metadata fields with updated requirements that curators MUST annotate.
The following metadata fields that curators MUST annotate have updated requirements in schema 4.0.0:
cell_type_ontology_term_id
self_reported_ethnicity_ontology_term_id
tissue_ontology_term_id
A previous example in schema 3.0.0 was:
assay_ontology_term_id
for annotating missing terms in EFOFor historical reasons, there are three cases where curated fields exist in many datasets that were neither documented in the schema nor fully validated by
cellxgene-schema
CLI:{column}_colors
is added. It is consumed in CELLxGENE Explorer for color assignments. Unfortunately, the quality of its validation has been appalling. Exceptions have been suppressed. Values have been permitted that are then ignored. Please collaborate with @atarashansky on {column}_colors must be validated.These cases are similar to updating the requirements for an existing metadata field.
IF an existing value is now invalid and cannot be corrected, THEN document the following in curator report:
Note:
{column}_colors
validation may produce many failures in the dry-run due to stricter validation. Potential failures may include:Deprecate Metadata field
Deprecated (deleted) metadata fields must be deleted. There are no cases in schema 4.0.0. Nonetheless, the design must support future automation.
A previous example in schema 3.0.0 was:
X_normalization
Rename Metadata field
Metadata fields may be renamed. There are no cases in schema 4.0.0. Nonetheless, the design must support future automation.
Previous examples in schema 3.0.0 were:
ethnicity_ontology_term_id
toself_reported_ethnicity_ontology_term_id
ethnicity
toself_reported_ethnicity
Warnings from new Anndata versions
See #single-cell-four.
Example:
IF warnings occur, THEN document the following in the curator report:
More needs to be understood about warnings before an automated solution can be proposed.
Automate preprint DOI to published DOI updates
Update the DOI in the A single-cell transcriptional roadmap of the mouse and human lymph node lymphatic vasculature collection is an example of an updated DOI that was discovered during prototyping. If an existing preprint DOI is queried again AND it has been published since the previous query, then Crossref returns the published DOI in:
This would allow the portal to refresh preprint DOI(s) with their published DOI(s) on a regular cadence.
Lattice discovered a case in the past where the publishers failed to update the relationship between a preprint DOI and its publication DOI.
Automate Dataset Title updates
This automation is out of scope for this iteration.
This feature has been requested in the past.There have also been conversations in #single-cell-data-wrangling about better naming guidelines that might require an editorial pass on dataset titles in the future.Inspiration
@jahilton shared _Opportunities to optimize_ based on the Lattice experience with the recent 3.0.0 migration.
Earlier blue-skying on Validation/Submission strategy (aka Automating Migration)
Review LinkML