NASA-IMPACT / veda-data-airflow

Airflow implementation of ingest pipeline for VEDA STAC data
Other
7 stars 4 forks source link

Reconcile validations with ingest-api #113

Open anayeaye opened 7 months ago

anayeaye commented 7 months ago

What

The workflows API duplicates validations that are executed by the ingest API. We need to make sure these have the same effect and/or are removed as not needed in workflows.

Note this is just a reminder to confirm that we didn't migrate any of our legacy validation bugs to the new workflows API. Success could be as simple as confirming that the validators in the workflows API are functionally the same as the recently corrected validators in veda-backend/ingest-api

Moreover, the current workflows API schema base model does not include the renders or providers fields and will fail when run with those properties. Either these fields should be included in the workflows model or leave downstream schema validation to ingestion API.

AC

botanical commented 6 months ago

I found one example of duplicated logic so far between the two repos (and will update this comment as I find more). The /dataset/publish endpoint seems to check if a collection exists twice.

and Dataset class has a validation to check if the collection exists because "we allow collection id to "break the rules" if an already-existing collection matches"

After the discover workflow kicks off an ingest workflow by calling the ingestion endpoint defined the veda-backend, veda-backend's enqueue_ingestion ’s parameter item uses schemas.AccessibleItem. This class has validators.collection_exists

Therefore, in veda-data-airflow and veda-backend, the check to validate that a collection exists is called twice.

TLDR

botanical commented 6 months ago

Validators

veda-data-airflow veda-backend veda-stac-ingestor
check_dates() - https://github.com/NASA-IMPACT/veda-data-airflow/blob/dev/workflows_api/runtime/src/schema_helpers.py#L42-L50 check_dates() - https://github.com/NASA-IMPACT/veda-backend/blob/develop/ingest_api/runtime/src/schema_helpers.py#L42-L50 check_dates() -https://github.com/NASA-IMPACT/veda-stac-ingestor/blob/main/api/src/schema_helpers.py#L48-L56
check_extent() - https://github.com/NASA-IMPACT/veda-data-airflow/blob/dev/workflows_api/runtime/src/schema_helpers.py#L27-L39 check_extent() - https://github.com/NASA-IMPACT/veda-backend/blob/develop/ingest_api/runtime/src/schema_helpers.py#L28-L39 check_extent() - https://github.com/NASA-IMPACT/veda-stac-ingestor/blob/main/api/src/schema_helpers.py#L33-L45
object_is_accessible() - https://github.com/NASA-IMPACT/veda-data-airflow/blob/dev/workflows_api/runtime/src/schemas.py#L156-L164 no matching function object_is_accessible() - https://github.com/NASA-IMPACT/veda-stac-ingestor/blob/main/api/src/schemas.py#L266-L274
check_time_density() on Dataset class - https://github.com/NASA-IMPACT/veda-data-airflow/blob/dev/workflows_api/runtime/src/schemas.py#L205-L208 no matching function - check_time_density() on https://github.com/NASA-IMPACT/veda-stac-ingestor/blob/main/api/src/schemas.py#L72-L76 and check_time_density() on https://github.com/NASA-IMPACT/veda-stac-ingestor/blob/main/api/src/schemas.py#L308-L311
check_sample_files - https://github.com/NASA-IMPACT/veda-data-airflow/blob/dev/workflows_api/runtime/src/schemas.py#L223-L262 no matching function check_sample_files - https://github.com/NASA-IMPACT/veda-stac-ingestor/blob/main/api/src/schemas.py#L326-L365
is_accessible on href - https://github.com/NASA-IMPACT/veda-data-airflow/blob/dev/workflows_api/runtime/src/schemas.py#L20-L34 is_accessible on href - https://github.com/NASA-IMPACT/veda-backend/blob/develop/ingest_api/runtime/src/schemas.py#L22-L36 is_accessible on href - https://github.com/NASA-IMPACT/veda-stac-ingestor/blob/main/api/src/schemas.py#L30-L44
exists() collection for AccessibleItem - https://github.com/NASA-IMPACT/veda-data-airflow/blob/dev/workflows_api/runtime/src/schemas.py#L37-L43 exists() collection for AccessibleItem - https://github.com/NASA-IMPACT/veda-backend/blob/develop/ingest_api/runtime/src/schemas.py#L39-L45 exists() collection for AccessibleItem - https://github.com/NASA-IMPACT/veda-stac-ingestor/blob/main/api/src/schemas.py#L47-L53
exists() for WorkflowInputBase - https://github.com/NASA-IMPACT/veda-stac-ingestor/blob/main/api/src/schemas.py#L234-L252 no matching function exists() for WorkflowInputBase - https://github.com/NASA-IMPACT/veda-stac-ingestor/blob/main/api/src/schemas.py#L234-L252
check_id() for Dataset - https://github.com/NASA-IMPACT/veda-data-airflow/blob/dev/workflows_api/runtime/src/schemas.py#L181-L203 NOTE: Allows unconventional naming if collection already exists no matching function check_id() for Dataset - https://github.com/NASA-IMPACT/veda-stac-ingestor/blob/main/api/src/schemas.py#L234-L306 NOTE: Does not allow non-lowercase names
only_one_discover_item on ZarrDataset - https://github.com/NASA-IMPACT/veda-data-airflow/blob/dev/workflows_api/runtime/src/schemas.py#L265-L281 no matching function only_one_discover_item on ZarrDataset - https://github.com/NASA-IMPACT/veda-stac-ingestor/blob/main/api/src/schemas.py#L368-L384

cc @anayeaye @smohiudd