chanzuckerberg / cryoet-data-portal

CryoET Data Portal
MIT License
16 stars 9 forks source link

Update ingestion config validation for ctf and alignment entities #997

Open manasaV3 opened 1 month ago

manasaV3 commented 1 month ago

blocking https://github.com/chanzuckerberg/cryoet-data-portal/issues/1091

Motivation

To be able to ingest the Alignments and Frames, and create relevant metadata, we need to update the config to support the new ingestion entities and updates to the metadata.

Definition of Done

  1. We are able to create a valid config that has entries for the CollectionMetadata and Alignment.
  2. We are able to create a valid config that has entries for new metadata fields specified for Tomogram and Annotations.

The valid config refers to a config created in the dataset_configs folder that passes all the validation checks specified in the make file.

Tasks

For the following entities, the update have to be made to:

  1. the dataset_config template to have an up to date sample for anyone creating a new config yaml file
  2. the ingestion_config to ensure we have validation for the created configs

collection_metadata

The mdoc files are currently ingested as a part of the rawtilts entity. This should no longer be the case.

Sources

The sources should follow the default structure

Config Migration

This requires an update for all the existing config files where source globs with mdoc extension is listed under rawtlt. The mdoc glob entires should moved to be under collection_metadata.sources. This would require the creation of the collection_metadata entity.

alignment

The alignment files are currently ingested as a part of the rawtilts entity. This should no longer be the case.

Metadata

field name field type required default value where to source it from if it already exists
alignment_type string(LOCAL,GLOBAL) true -
offset.x int true 0 -
offset.y int true 0 -
offset.z int true 0 -
x_rotation_offset int true 0 -
tilt_offset float true 0 -
affine_transformation_matrix 4x4 matrix true identity matrix tomogram.affine_transformation_matrix
is_canonical bool true true -
format str(IMOD, ARETOMO3) true - -
volume_dimesion dict false - -
volume_dimesion.x int true - -
volume_dimesion.y int true - -
volume_dimesion.z int true - -

Sources

The sources should follow the default structure

Config Migration

This requires an update for all the existing files with xf to be moved to alignment This requires an update for all the existing config files where source globs with xf, tlt(not rawtlt), aln, com extensions are listed under rawtlt. The above entires should moved to be under alignment.sources. This would require the creation of the alignment entity.

The affine_transformation_matrix should also be moved to the alignment metadata from tomogram.

depending on what the file extension of the source is the metadata.field should be:

frames

The frames currently don't have any metadata associated to them. They can have an optional metadata field.

Metadata

field name field type required default value where to source it from if it already exists
dose float true - -
defocus float true - -
astigmatism float true - -
astigmatic_angle float true - -

Config Migration

No config migration required. All these fields will be added manually at a later point.

tomogram

The tomogram currently already have metadata associated to them. We need to update the fields it currently supports and validates.

Metadata

field name field type required default value where to source it from if it already exists
is_portal_standard bool false false -
is_visualization_default bool true true -
cross_references dict false - -
cross_references.publications string false - -
cross_references.related_database_entries string false - -
dates dict true - deposition.metadata.dates
dates.deposition_date date true -
dates.last_modified_date date true -
dates.release_date date true -

Config Migration

Add the two new bool fields to the all existing config yamls, with the default values Also, remove the affine_transformation_matrix field, as it is getting moved to the alignment entity. We will retain the affine_transformation_matrix for backward compatibility

annotations

Standardized annotations are no longer a part of this effort.

The annotations currently already have metadata associated to them. We need to update the fields it currently supports and validates.

Sources

~~As we can have portal standard annotations now. We should allow for up to 2 entries for the same shape in source, the caveat for that being, at least one of them should have a is_portal_standard set to true.~~

field name field type required default value where to source it from if it already exists
source.[].<>.is_portal_standard bool false false -

Config Migration

Add the new bool field to the all existing config yamls, with the default values

manasaV3 commented 1 week ago

We are adding two new fields to alignment metadata.

  1. format This will be used for determining how to process the alignment files. For the validation of this field, @uermel will provide the acceptable values.
  2. volume_dimesion This will be used for provide the dimensions of the tomogram that will be created from this alignment. If it is not provided, this value will be sourced from the tomogram. This field is optional, but if is exists, it should contain entries for x, y and z.
manasaV3 commented 2 days ago

Any updates to the annotation source config will be addressed as a part of a different issue, once we have a decision on portal standard annotations being a part of the original or newer deposition.