SciCatProject / scicat-backend-next

SciCat Data Catalogue Backend
https://scicatproject.github.io/documentation/
BSD 3-Clause "New" or "Revised" License
20 stars 24 forks source link

feat(jobs): ValidateAction #1421

Closed sbliven closed 3 weeks ago

sbliven commented 2 months ago

Description

Add ValidateAction to check Job DTO formatting

Motivation

When creating a job clients can provide an untyped jobParams object in the DTO, or a jobResultObject when updating. These were previously not validated.

Fixes:

This PR adds a new validate job action which can test for the presence of particular attributes in the DTO. Currently the attributes are specified using a JSONPath and are only verified to be non-null. If needed, future implementations could add JSONSchema support to do deeper validation.

For example, the following jobConfig.json section would configure the public job to require at least one datasetId:

{
      "jobType": "public",
      "create": {
        "auth": "#all",
        "actions": [
          {
            "actionType": "validate",
            "required": ["jobParams.datasetIds[0]"]
          }
        ]
      },
      "statusUpdate": {
        "auth": "#all"
      }
    }

Changes:

Tests included

Documentation

sbliven commented 2 months ago

What does everyone think of the use of JSONPath?

Pro: Simple syntax for simple reference, while keeping enough power for more complex structures like arrays Pro: Well-known standard (eg used by jq command) Con: Adds a dependency Con: Only checks element presence, not type

A more powerful option would be to validate against a JSON Schema. This would allow type checking. However I think it would be much more verbose.

sofyalaski commented 1 month ago

I think the JSONpath solution is appropriate for now. However, there is a minor issue in the jobs.controller: performJobCreateAction now accesses only the jobInstance of JobClass as an argument (this also needed to call the other function that performs the actions). performJobCreateAction should take two arguments, the second being of JobDto type so that we can call the validate function. Now in the controller validate is called on JobClass instance, which is not how validate is defined.

sbliven commented 1 month ago

Good catch, @sofyalaski. Why does this even compile?

sbliven commented 1 month ago

Ah, it compiles because JobClass is structurally compatible with CreateJobDto. Working on a fix now.

sbliven commented 1 month ago

This implementation has a couple problems.

First, the JSONPath solution doesn't perform any type checking. It would be nice to do something like

{
  "actionType": "validate",
  "required": {
    "jobParams.datasetIds": "object",
    "jobParams.datasetIds[*]": "string"
  }
}

This looks fine for simple types (string, number, list of literal values) but could quickly grow in complexity until it reaches a full type system. One solution would be to supply a JSON Schema to validate the DTO (which provides full type support). However including JSON schemas in the config file seems overly complex for operators. Is there a happy medium?

Second, the validate method currently acts against the DTO body. Half of the checks in v3 are actually against database entities referenced in the DTO or Job itself. For instance, "archive" jobs should check that all datasets associated with the job have "datasetLifecycle.archivable". Or "publish" jobs with file listings should check that all files exist, requiring checking associated DataBlocks from the dataset. Should these be added to a generic ValidateAction or be a more specific check? We are leaning towards handling those as special cases in the jobs controller.

sbliven commented 1 month ago

I changed how this works somewhat. Now the ValidateAction config looks like:

{
  "actionType": "validate",
  "request": {
    "JSONPath": JSONSchema,
    ...
  }
}

The schema gets validated by avj, so you theoretically could use the full power of a schema. However combining it with the path allows most use cases to be covered by short, easily readable one-liners. Examples:

{
  "actionType": "validate",
  "request": {
    "jobParams.name": { "type": "string"}, // match simple types
    "jobParams.answers[*]": { "enum": ["yes","no"]}, // literal values (here applied to an array)
    "jobResultObject": {"$ref": "https://json.schemastore.org/schema-org-thing.json"}, // JSON Schema
  }
}

I spelled the top-level request to because it gets applied to the DTO (either the create or update DTO). If it needed to get applied somewhere else (eg to associated datasets) we can add that in the future.

despadam commented 3 weeks ago

Agreed to merge https://github.com/SciCatProject/scicat-backend-next/pull/1440, then remove checkDatasetState from the controller and incorporate its logic into the validate action, which will allow for configuring jobType - datasetLifecycle combinations and deleting the hardcoded job-types.enum. These changes will come in a new PR which will also incorporate this: https://github.com/SciCatProject/scicat-backend-next/issues/1120