chanzuckerberg / cryoet-data-portal-backend

CryoET Data Portal API server & ingestion scripts
MIT License
1 stars 2 forks source link

Create validation exclusions for dataset config extended validation #164

Closed daniel-ji closed 1 month ago

daniel-ji commented 1 month ago

Additional fixes: updated the dataset_config_validate_extended.py with correct required fields for metadata and root-level entries; cleaned up its python code as well

Adds a flag to enable ontology object id / name validation skipping (for fields annotation_object, cell_component, cell_strain, cell_type, organism, tissue). Addresses same concern as #149 (sometimes it doesn't make sense to check the ontology object's ID against the name and the name is intentionally different from what is stored online).

Feature description (also in schema/v1.1.0/docs/dataset_config_validate.yaml):

===================================

--field-excludelist-file A JSON file specifying which class-field-value mappings do not need to be validated when running Pydantic extended validation. Note that requirement / pattern / enum / type validation will still be performed.

Currently only supports skipping ontology object validation (dataset config fields: annotation_object, cell_component, cell_strain, cell_type, organism, tissue). This option is useful when you want to skip certain fields that intentionally fail validation. For example, sometimes validation doesn't want to be run on the name of a cell_strain, as it may not be what the cell strain's id corresponds to online.

JSON file format (note that the ClassNameToSkipOn is the class name in the schema/v1.1.0/dataset_config_models.py, which may be different from the class name in the dataset configuration file):

{
    "ClassNameToSkipOn": {
        "field_name_to_skip_on1": ["field_value_to_skip1", "field_value_to_skip2"]
        "field_name_to_skip_on2": ["field_value_to_skip3", ...],
        ...
    },
    "AnotherClassNameToSkipOn": {
        ...
    },
    ...
}

Example file:

{
    "AnnotationObject": {
        "id": ["GO:0030992", "GO:0035869"],
        "name": []
    },
    "CellType": {
        "id": [],
        "name": ["umbilical vein endothelial cell"]
    }
}