bacalhau-project / bacalhau

Compute over Data framework for public, transparent, and optionally verifiable computation
https://docs.bacalhau.org
Apache License 2.0
643 stars 85 forks source link

refactor: move validate to job validate #4074

Closed frrist closed 3 days ago

frrist commented 2 weeks ago

What

The initial intention of the PR was to simply move the validate command under job s.t. job validate became the new one. This new command is meant to validate a v2 job spec/models job spec. However it grew slightly in scope since we've never actually done real validation and I didn't want to just move a broken command.

The previous version of this command generated a job spec from the job model: Generate: https://github.com/bacalhau-project/bacalhau/blob/2e0c32949bba96bce07bcf50df14328ac1900463/cmd/cli/validate/validate.go#L165 Job Model https://github.com/bacalhau-project/bacalhau/blob/2e0c32949bba96bce07bcf50df14328ac1900463/pkg/model/job.go#L191-L239

However since the previous job model didn't use the json_schema tags for generating a spec, and since all fields contained omittempty tags there wasn't really any validation happening since nearly all fields were incorrectly considered optional. e.g. in v1.3.2 bacalhau validate considers the below job file valid:

 APIVersion: v1beta1
Spec:
  Engine: docker
  Deal:
    Concurrency: 1

The Change

In this PR I have implemented a complete json schema describing our new job model which lives in schemas/job/schemas. This schema is then embedded in the bacalhau application when it runs. The motivation here is to have a validate command that actually performs validation. This schema aligns with the existing validation logic we have defined in go code. Ideally we will replace the manual go validation code with the requirements of this schema at some point in the future.

Caveats

  1. The schema requires all yaml/json keys be capitalized.
  2. The schema requires (almost) all values (like docker, wasm, local, s3, ipfs, etc) align with the definitions in the code which are typically lower case.
  3. This schema will need adjustment/further development if/when we introduce pluggable executors/storages/publishers since their schema is not represented.
  4. This schema only validates users submitted job specs. Since the system representation of a job spec contains fields that users cannot set we do validate they are not present, but we cannot re-use this schema to validate a system representation of a job. Some prior discussion around system v submission job specs here: https://github.com/bacalhau-project/bacalhau/issues/3983

Since this is an improvement over the prior behavior this feels acceptable. And if users take issue with this we can do some per-processing on their behalf.

coderabbitai[bot] commented 2 weeks ago

[!IMPORTANT]

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Tips ### Chat There are 3 ways to chat with [CodeRabbit](https://coderabbit.ai): - Review comments: Directly reply to a review comment made by CodeRabbit. Example: - `I pushed a fix in commit .` - `Generate unit testing code for this file.` - `Open a follow-up GitHub issue for this discussion.` - Files and specific lines of code (under the "Files changed" tab): Tag `@coderabbitai` in a new review comment at the desired location with your query. Examples: - `@coderabbitai generate unit testing code for this file.` - `@coderabbitai modularize this function.` - PR comments: Tag `@coderabbitai` in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples: - `@coderabbitai generate interesting stats about this repository and render them as a table.` - `@coderabbitai show all the console.log statements in this repository.` - `@coderabbitai read src/utils.ts and generate unit testing code.` - `@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.` - `@coderabbitai help me debug CodeRabbit configuration file.` Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. ### CodeRabbit Commands (invoked as PR comments) - `@coderabbitai pause` to pause the reviews on a PR. - `@coderabbitai resume` to resume the paused reviews. - `@coderabbitai review` to trigger an incremental review. This is useful when automatic reviews are disabled for the repository. - `@coderabbitai full review` to do a full review from scratch and review all the files again. - `@coderabbitai summary` to regenerate the summary of the PR. - `@coderabbitai resolve` resolve all the CodeRabbit review comments. - `@coderabbitai configuration` to show the current CodeRabbit configuration for the repository. - `@coderabbitai help` to get help. Additionally, you can add `@coderabbitai ignore` anywhere in the PR description to prevent this PR from being reviewed. ### CodeRabbit Configration File (`.coderabbit.yaml`) - You can programmatically configure CodeRabbit by adding a `.coderabbit.yaml` file to the root of your repository. - Please see the [configuration documentation](https://docs.coderabbit.ai/guides/configure-coderabbit) for more information. - If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: `# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json` ### Documentation and Community - Visit our [Documentation](https://coderabbit.ai/docs) for detailed information on how to use CodeRabbit. - Join our [Discord Community](https://discord.com/invite/GsXnASn26c) to get help, request features, and share feedback. - Follow us on [X/Twitter](https://twitter.com/coderabbitai) for updates and announcements.
frrist commented 2 weeks ago

I think you've missed to include job-schemas.json file.

ooOooo great catch, thanks - this explains a lot (we had an overzealous ignore directive) https://github.com/bacalhau-project/bacalhau/blob/main/.gitignore#L49. I have addressed this in the latest commit and added a test to ensure the validate command continues to work.

The schema validation is enforcing the executors, publishers and storage sources we work with.

Correct, the current schema validates the engines, storages, and publishers we currently support.

We are no longer plugable or modular where allow different networks to deploy their own engines

We never were plugable in the first place. At present we don't support plugable engine, storage, or publisher types. In fact we return a 500 error if someone tries to use an engine that we don't support, for example (job spec):

> bacalhau job run testdata/jobs/custom-task-type.yaml 
Error: failed request: Unexpected response code: 500 ({
  "error": "failed to translate job type: flibble: unknown task type identified in translation: 'flibble'",
  "message": "Internal Server Error"
})

We've discussed this before as having this level of flexibility will limit the type of client side validation we can do. There might be ways to download schemas at runtime from the requester, but thats a bigger topic and needs evaluating different user scenarios to see if it makes sense, as it will enforce the requester to know more about the engines, which it doesn't have to today.

Agreed, validation of plugable types is a broader topic. And since we don't support plugable types now I agree we should flesh this out at a later point.

Our schema validation is more than just checking field types and whether they exist or not. In Task validation for example we check that input source aliases are unique.

Agreed, and this brings up a good point - or current validation is somewhat opaque to users. They must read through the go code to gather an understanding of what a valid job spec is. A benefit of adding schemas is that we have defined the rules of what's valid and what isn't in a canonical/standard manner.

We already have a job.ValidateSubmission method that is used to validate jobs on the client side , server side, and also on the backend side for additional validation before writing and reading from the datastore

Indeed. And to be clear I think this method still adds value, and is something we should continue to use. But where it falls short is validating that unsupported fields weren't included in the job spec, for example consider (Kind, ThisIsntAllowd, and Selector):

> bacalhau job run
Reading from /dev/stdin; send Ctrl-d to stop.Namespace: default
Count: 1
Type: batch
Kind: whatever
ThisIsntAllowed: but it is
Selector: fobar
Tasks:
  - Name: main
    Engine:
      Type: docker
      Params:
        Image: ubuntu:latest

Job successfully submitted. Job ID: j-b54ed6c2-548c-41aa-8cce-c4a135acf51a
Checking job status... (Enter Ctrl+C to exit at any time, your job will continue running):

        Communicating with the network  ................  done ✅  0.0s
           Creating job for submission  ................  done ✅  0.5s

To get more details about the run, execute:
        ./bin/linux/amd64/bacalhau job describe j-b54ed6c2-548c-41aa-8cce-c4a135acf51a

To get more details about the run executions, execute:
        ./bin/linux/amd64/bacalhau job executions j-b54ed6c2-548c-41aa-8cce-c4a135acf51a

Comparatively, when the same job is provided to job validate the user is given clear error messages:

> bacalhau job validate bad_job.yaml 
The Job is not valid. See errors:
- (root): Name is required
- (root): Additional property Kind is not allowed
- (root): Additional property Selector is not allowed
- (root): Additional property ThisIsntAllowed is not allowed

Why couldn't you just use job.ValidateSubmission in job validate instead of implementing json validation now

We can, and will make the change if you feel strongly about it. I didn't initially because the original validate command used a json schema for validation (as outlined in the PR description) and I intended to adopt a similar pattern for consistency.

Can you share examples of validation error messages to understand the user experience and how helpful these error messages are?

Yup, see the above example.

My overall feedback is to use the existing job.ValidateSubmission to unblock the API/CLI migration, and avoid adding major changes a week before the launch. We can have better discussion on the right thing to do when we have more room.

Is my understanding correct that you are blocking this PR until the schema is replaced with job.ValidateSubmission? Or is there something else that is blocking?

@aronchick can you comment from the product side if this (a json schema of our job spec) is something we want to support? If we feel the schema isn't adding value I will revert this change and use job.ValidateSubmission

aronchick commented 2 weeks ago

No - for SURE we want a json schema of our job spec. In order to be a great platform, we need people externally to us to be submitting jobs. Offering a hosted job back that is the exact same one as we use internally is a very big positive.

aronchick commented 2 weeks ago

I understand about not wanting to cause big changes the week before we launch, but for better or worse, this looks like the right thing to me. @wdbaruni can you comment on what your reservation is here?

frrist commented 2 weeks ago

And just so its clear, this schema-based validation is only used by the code path of the job validate command and not anywhere else.

wdbaruni commented 1 week ago

We never were plugable in the first place. At present we don't support plugable engine, storage, or publisher types. In fact we return a 500 error if someone tries to use an engine that we don't support, for example (job spec):

bacalhau job run testdata/jobs/custom-task-type.yaml Error: failed request: Unexpected response code: 500 ({ "error": "failed to translate job type: flibble: unknown task type identified in translation: 'flibble'", "message": "Internal Server Error" })

This looks like a recently introduced bug related to bacalhau exec translations. Still I understand we are not pluggable today, but that is our goal and we are modular today where we are decoupling the engines from the clients and orchestrations. That is why I have been referring to pluggable and modular together

Agreed, validation of plugable types is a broader topic. And since we don't support plugable types now I agree we should flesh this out at a later point.

Exactly. So lets not add client side validations of our engines today until we flesh this out. We are modular today we are can introduce new publishers, executors or storage engines as new modules in any release. We don't force users to update their client today, and lets not do that until we flesh things out.

Agreed, and this brings up a good point - or current validation is somewhat opaque to users. They must read through the go code to gather an understanding of what a valid job spec is. A benefit of adding schemas is that we have defined the rules of what's valid and what isn't in a canonical/standard manner.

Which users do you think will go through json schemas to know how to submit jobs? If you are referring to normal users, then our documentation should be their only source of information to work with Bacalhau, and I'll feel very sorry for them if they had to go through the json schema definition. If you are referring to bacalhau developers, then all our validation in located in one ValidateSubmission method under Job model, and they are more familiar with go code than walking through schema definition. If you are referring to integration developers, then is this the right time to think about that and what did we do to make sure this is the right solution for them? Lets not add more complexity by assuming what users want until we know better

Comparatively, when the same job is provided to job validate the user is given clear error messages:

bacalhau job validate bad_job.yaml The Job is not valid. See errors:

  • (root): Name is required
  • (root): Additional property Kind is not allowed
  • (root): Additional property Selector is not allowed
  • (root): Additional property ThisIsntAllowed is not allowed

Didn't you add a similar experience in https://github.com/bacalhau-project/bacalhau/pull/4103 by providing a custom unmarshaller?

We can, and will make the change if you feel strongly about it. I didn't initially because the original validate command used a json schema for validation (as outlined in the PR description) and I intended to adopt a similar pattern for consistency.

I mean now a job can pass job validate but will fail job run --dry-run validation, and vice versa because we are using different paths and logics for validation. For example, job validate fails unknown fields, and job run --dry-run does additional validations such as uniqueness of input alias, count based on job type, and will ignore unknown fields. I wouldn't mind job validate to have additional validations and job run to be more graceful, but that is not the case here

wdbaruni commented 1 week ago

And just so its clear, this schema-based validation is only used by the code path of the job validate command and not anywhere else.

And that makes things even worse as this whole complexity is just for a command that is barely used and follows a different validation path than in the server side or job run, but I don't think the intention is to use this schema validation only in job validate

wdbaruni commented 1 week ago

No - for SURE we want a json schema of our job spec. In order to be a great platform, we need people externally to us to be submitting jobs. Offering a hosted job back that is the exact same one as we use internally is a very big positive. I understand about not wanting to cause big changes the week before we launch, but for better or worse, this looks like the right thing to me. @wdbaruni can you comment on what your reservation is here?

This is a general statement Dave. We need to understand what problem we want to solve or improve, and what is the right time and way to deliver it. It seems from the answers above, the only value is external integrations, and this is not the right time to solve that or evaluate the right way to achieve it.

My reservation is that anything we introduce is a cost that we have to maintain, and I am not sure why we keep on introducing things without any immediate value or user demand, and adds these complexities to us:

  1. Forking the validation path as it seems we can't do all validations against the schema, and will continue to maintain both paths.
  2. As a developer, I'll need to maintain both paths, not clear where to encode new validations, and we will need to make sure the json schema is always in sync
  3. We are not really improving the user experience as validate can pass while run can fail
  4. We are restricting the engines that we support, and forcing the user to update the client if any engine is introduced, which goes against our goal of being agnostic to the engines at the compute side
  5. We just introduced json schemas for the job spec, but for integration purposes we need them for all of our APIs. We can't keep adding and maintaining these schemas manually. We need to look for ways to automate their creation out of our golang types.

All of these complexities are worth owning and maintaining when they are worth solving and have demand for the values that they add, but that is not today. I would also like to explore auto-generation of json schemas from our structs instead of encoding them manually, but only when there is a demand for them.

We have already spent some time implementing CUE lang schema validations before that didn't make it through for some reason, now we are suggesting json schema instead for no apparent reason and no evaluation of alternatives. These types of distractions is what is concerning me. We are less than a week away from 1.4, the features are not all done yet, and we shouldn't add changes last minute. In fact we should've already frozen 1.4 but we can't as we are behind.

frrist commented 1 week ago

Didn't you add a similar experience in https://github.com/bacalhau-project/bacalhau/pull/4103 by providing a custom unmarshaller?

Yup, #4103 is the fall back plan for this work remaining in draft. Will open a new PR that uses that method instead.