Add validation Github workflow

chanzuckerberg / cryoet-data-portal-backend

CryoET Data Portal API server & ingestion scripts

MIT License

1 stars 2 forks source link

Add validation Github workflow #130

Closed daniel-ji closed 1 month ago

daniel-ji commented 1 month ago

Fourth of 4 PRs, as part of dataset configuration file validation. To be merged after #129

jgadling commented 1 month ago

The output of the action should probably list the files that failed validation

daniel-ji commented 1 month ago

The workflow outputs the log files as an artifact, the previous run is here: https://github.com/chanzuckerberg/cryoet-data-portal-backend/actions/runs/9883555982?pr=130. Sorry if this is a bit confusing, but I decided to do it this way because sometimes the log files can get quite long. Should I keep it like this?

Note that this is expected to fail right now, since we're still working through the validation and updating the schema / template / config files. Thanks!

jgadling commented 1 month ago

Ok, the full JSON output can be super long but I think we can parse the errors down to something fairly reasonable (even if there are a lot of failures at the moment.) Because eventually we expect to have very few failures (probably only the most recent dataset being added to the repo) and having to download and review artifacts is a pretty cumbersome workflow imop. Based on reviewing the JSON output, we should be able to print something along these lines:

FAIL: ../dataset_configs/10008.yaml"
    - Extra inputs are not permitted: tiltseries.0.metadata.alignment_binning_factor 

FAIL: ../dataset_configs/10009.yaml"
    - Field required: tiltseries.0.metadata.tilt_axis
    - Field required: tiltseries.2.metadata.tilt_axis

Basically we can just print the "msg" string for each error, and the "loc" data separated by . characters. This should be a good enough hint at what needs fixing, even if the loc strings look a little funny.

jgadling commented 1 month ago

Annnnd one more thing -- the filtered error files (dataset_config_validate_errors_filtered.txt and dataset_config_validate_errors_filtered_2.txt) in the build artifact are a little hard to understand + I think they only show errors for the last dataset that was processed? It seems like they're too short. Personally I'd consider doing away with the filtered output entirely and just having the utility script output errors in the format I suggested above ^^

daniel-ji commented 1 month ago

Sounds good! I've gone ahead and made your suggested changes, thanks! Note that the validation right now is still in progress and by no means comprehensive, so that's why there might be fewer errors than expected. Thanks!