HumanCellAtlas / metadata-schema

This repo is for the metadata schemas associated with the HCA
Apache License 2.0
65 stars 32 forks source link

Remove `schema_major_version` and `schema_minor_version` from provenance #1316

Open hannes-ucsc opened 4 years ago

hannes-ucsc commented 4 years ago

From DCP v2 spec, section Schema validation:

There is one known existing schema violation in documents in the DCP v1 production instance of DSS. The provenance.schema_major_version and provenance.schema_minor_version properties are present in metadata files submitted to the DSS from around Oct 2019 onwards. The addition of these fields was proposed in RFC 11. After that RFC was accepted, the provenance schema was revised and the Ingest component was modified to add those fields to documents but the schema reference in those documents still points at an old schema, not the revised one. This is further complicated by the fact that the provenance schema is referenced indirectly via the main document schema (example document). The analysis_file schema in that document is at version 6.0.0 while the new fields were introduced in version 6.2.0 (via the provenance schema version 1.1.0). The problem affects all metadata documents submitted after October of 2019, not just analysis_file documents.

To address this issue, the DSS adapter removes those two fields and the Ingest adapter is modified to not emit them. Luckily, the fields are not required so removing them from documents that do happen to carry an updated schema declaration does not invalidate those documents. The provenance schema will be revised to remove the fields again.

These two fields need to be removed from the schema:

https://github.com/HumanCellAtlas/metadata-schema/blob/c2ab8a5115a2fabd4fcbb57fda89022be12cd939/json_schema/system/provenance.json#L24

https://github.com/HumanCellAtlas/metadata-schema/blob/c2ab8a5115a2fabd4fcbb57fda89022be12cd939/json_schema/system/provenance.json#L31

mshadbolt commented 4 years ago

Reopening because I could not incorporate these into the latest release.

hannes-ucsc commented 4 years ago

Noticed that the resolution was backed out. Why does releasing a new version of a schema invalidate existing documents? I thought the whole point of the schema versioning process is to allow us to freely release new schemas without affecting existing metadata documents.

mshadbolt commented 4 years ago

We don't have a way to migrate things ingest. It wasn't known what impact this change would have and given it affects every type schema I was hesitant to make the change before we have a good understanding of the consequences. My understanding was the file_descriptor change was high priority therefore I pushed that changed and held off on this one.

Yes in theory we are supposed to be able to update metadata schemas as and when we need to, in reality we haven't been able to for over a year.

hannes-ucsc commented 4 years ago

I think it was the right call to back the changes out if they are destabilizing. I am trying to understand why they are destabilizing.

Why does releasing a new version of an existing schema invalidate existing documents? Those refer to the old version, which doesn't change, correct?