Closed frrist closed 3 days ago
[!IMPORTANT]
Review skipped
Auto reviews are disabled on this repository.
Please check the settings in the CodeRabbit UI or the
.coderabbit.yaml
file in this repository. To trigger a single review, invoke the@coderabbitai review
command.You can disable this status message by setting the
reviews.review_status
tofalse
in the CodeRabbit configuration file.
I think you've missed to include job-schemas.json file.
ooOooo great catch, thanks - this explains a lot (we had an overzealous ignore directive) https://github.com/bacalhau-project/bacalhau/blob/main/.gitignore#L49. I have addressed this in the latest commit and added a test to ensure the validate command continues to work.
The schema validation is enforcing the executors, publishers and storage sources we work with.
Correct, the current schema validates the engines, storages, and publishers we currently support.
We are no longer plugable or modular where allow different networks to deploy their own engines
We never were plugable in the first place. At present we don't support plugable engine, storage, or publisher types. In fact we return a 500 error if someone tries to use an engine that we don't support, for example (job spec):
> bacalhau job run testdata/jobs/custom-task-type.yaml
Error: failed request: Unexpected response code: 500 ({
"error": "failed to translate job type: flibble: unknown task type identified in translation: 'flibble'",
"message": "Internal Server Error"
})
We've discussed this before as having this level of flexibility will limit the type of client side validation we can do. There might be ways to download schemas at runtime from the requester, but thats a bigger topic and needs evaluating different user scenarios to see if it makes sense, as it will enforce the requester to know more about the engines, which it doesn't have to today.
Agreed, validation of plugable types is a broader topic. And since we don't support plugable types now I agree we should flesh this out at a later point.
Our schema validation is more than just checking field types and whether they exist or not. In Task validation for example we check that input source aliases are unique.
Agreed, and this brings up a good point - or current validation is somewhat opaque to users. They must read through the go code to gather an understanding of what a valid job spec is. A benefit of adding schemas is that we have defined the rules of what's valid and what isn't in a canonical/standard manner.
We already have a job.ValidateSubmission method that is used to validate jobs on the client side , server side, and also on the backend side for additional validation before writing and reading from the datastore
Indeed. And to be clear I think this method still adds value, and is something we should continue to use. But where it falls short is validating that unsupported fields weren't included in the job spec, for example consider (Kind
, ThisIsntAllowd
, and Selector
):
> bacalhau job run
Reading from /dev/stdin; send Ctrl-d to stop.Namespace: default
Count: 1
Type: batch
Kind: whatever
ThisIsntAllowed: but it is
Selector: fobar
Tasks:
- Name: main
Engine:
Type: docker
Params:
Image: ubuntu:latest
Job successfully submitted. Job ID: j-b54ed6c2-548c-41aa-8cce-c4a135acf51a
Checking job status... (Enter Ctrl+C to exit at any time, your job will continue running):
Communicating with the network ................ done ✅ 0.0s
Creating job for submission ................ done ✅ 0.5s
To get more details about the run, execute:
./bin/linux/amd64/bacalhau job describe j-b54ed6c2-548c-41aa-8cce-c4a135acf51a
To get more details about the run executions, execute:
./bin/linux/amd64/bacalhau job executions j-b54ed6c2-548c-41aa-8cce-c4a135acf51a
Comparatively, when the same job is provided to job validate
the user is given clear error messages:
> bacalhau job validate bad_job.yaml
The Job is not valid. See errors:
- (root): Name is required
- (root): Additional property Kind is not allowed
- (root): Additional property Selector is not allowed
- (root): Additional property ThisIsntAllowed is not allowed
Why couldn't you just use job.ValidateSubmission in job validate instead of implementing json validation now
We can, and will make the change if you feel strongly about it. I didn't initially because the original validate command used a json schema for validation (as outlined in the PR description) and I intended to adopt a similar pattern for consistency.
Can you share examples of validation error messages to understand the user experience and how helpful these error messages are?
Yup, see the above example.
My overall feedback is to use the existing job.ValidateSubmission to unblock the API/CLI migration, and avoid adding major changes a week before the launch. We can have better discussion on the right thing to do when we have more room.
Is my understanding correct that you are blocking this PR until the schema is replaced with job.ValidateSubmission
? Or is there something else that is blocking?
@aronchick can you comment from the product side if this (a json schema of our job spec) is something we want to support? If we feel the schema isn't adding value I will revert this change and use job.ValidateSubmission
No - for SURE we want a json schema of our job spec. In order to be a great platform, we need people externally to us to be submitting jobs. Offering a hosted job back that is the exact same one as we use internally is a very big positive.
I understand about not wanting to cause big changes the week before we launch, but for better or worse, this looks like the right thing to me. @wdbaruni can you comment on what your reservation is here?
And just so its clear, this schema-based validation is only used by the code path of the job validate
command and not anywhere else.
We never were plugable in the first place. At present we don't support plugable engine, storage, or publisher types. In fact we return a 500 error if someone tries to use an engine that we don't support, for example (job spec):
bacalhau job run testdata/jobs/custom-task-type.yaml Error: failed request: Unexpected response code: 500 ({ "error": "failed to translate job type: flibble: unknown task type identified in translation: 'flibble'", "message": "Internal Server Error" })
This looks like a recently introduced bug related to bacalhau exec
translations. Still I understand we are not pluggable today, but that is our goal and we are modular today where we are decoupling the engines from the clients and orchestrations. That is why I have been referring to pluggable and modular together
Agreed, validation of plugable types is a broader topic. And since we don't support plugable types now I agree we should flesh this out at a later point.
Exactly. So lets not add client side validations of our engines today until we flesh this out. We are modular today we are can introduce new publishers, executors or storage engines as new modules in any release. We don't force users to update their client today, and lets not do that until we flesh things out.
Agreed, and this brings up a good point - or current validation is somewhat opaque to users. They must read through the go code to gather an understanding of what a valid job spec is. A benefit of adding schemas is that we have defined the rules of what's valid and what isn't in a canonical/standard manner.
Which users do you think will go through json schemas to know how to submit jobs? If you are referring to normal users, then our documentation should be their only source of information to work with Bacalhau, and I'll feel very sorry for them if they had to go through the json schema definition. If you are referring to bacalhau developers, then all our validation in located in one ValidateSubmission
method under Job
model, and they are more familiar with go code than walking through schema definition. If you are referring to integration developers, then is this the right time to think about that and what did we do to make sure this is the right solution for them? Lets not add more complexity by assuming what users want until we know better
Comparatively, when the same job is provided to job validate the user is given clear error messages:
bacalhau job validate bad_job.yaml The Job is not valid. See errors:
- (root): Name is required
- (root): Additional property Kind is not allowed
- (root): Additional property Selector is not allowed
- (root): Additional property ThisIsntAllowed is not allowed
Didn't you add a similar experience in https://github.com/bacalhau-project/bacalhau/pull/4103 by providing a custom unmarshaller?
We can, and will make the change if you feel strongly about it. I didn't initially because the original validate command used a json schema for validation (as outlined in the PR description) and I intended to adopt a similar pattern for consistency.
I mean now a job can pass job validate
but will fail job run --dry-run
validation, and vice versa because we are using different paths and logics for validation. For example, job validate
fails unknown fields, and job run --dry-run
does additional validations such as uniqueness of input alias, count based on job type, and will ignore unknown fields. I wouldn't mind job validate
to have additional validations and job run
to be more graceful, but that is not the case here
And just so its clear, this schema-based validation is only used by the code path of the
job validate
command and not anywhere else.
And that makes things even worse as this whole complexity is just for a command that is barely used and follows a different validation path than in the server side or job run
, but I don't think the intention is to use this schema validation only in job validate
No - for SURE we want a json schema of our job spec. In order to be a great platform, we need people externally to us to be submitting jobs. Offering a hosted job back that is the exact same one as we use internally is a very big positive. I understand about not wanting to cause big changes the week before we launch, but for better or worse, this looks like the right thing to me. @wdbaruni can you comment on what your reservation is here?
This is a general statement Dave. We need to understand what problem we want to solve or improve, and what is the right time and way to deliver it. It seems from the answers above, the only value is external integrations, and this is not the right time to solve that or evaluate the right way to achieve it.
My reservation is that anything we introduce is a cost that we have to maintain, and I am not sure why we keep on introducing things without any immediate value or user demand, and adds these complexities to us:
validate
can pass while run
can failAll of these complexities are worth owning and maintaining when they are worth solving and have demand for the values that they add, but that is not today. I would also like to explore auto-generation of json schemas from our structs instead of encoding them manually, but only when there is a demand for them.
We have already spent some time implementing CUE lang schema validations before that didn't make it through for some reason, now we are suggesting json schema instead for no apparent reason and no evaluation of alternatives. These types of distractions is what is concerning me. We are less than a week away from 1.4, the features are not all done yet, and we shouldn't add changes last minute. In fact we should've already frozen 1.4 but we can't as we are behind.
Didn't you add a similar experience in https://github.com/bacalhau-project/bacalhau/pull/4103 by providing a custom unmarshaller?
Yup, #4103 is the fall back plan for this work remaining in draft. Will open a new PR that uses that method instead.
What
The initial intention of the PR was to simply move the
validate
command underjob
s.t.job validate
became the new one. This new command is meant to validate a v2 job spec/models job spec. However it grew slightly in scope since we've never actually done real validation and I didn't want to just move a broken command.The previous version of this command generated a job spec from the job model: Generate: https://github.com/bacalhau-project/bacalhau/blob/2e0c32949bba96bce07bcf50df14328ac1900463/cmd/cli/validate/validate.go#L165 Job Model https://github.com/bacalhau-project/bacalhau/blob/2e0c32949bba96bce07bcf50df14328ac1900463/pkg/model/job.go#L191-L239
However since the previous job model didn't use the
json_schema
tags for generating a spec, and since all fields containedomittempty
tags there wasn't really any validation happening since nearly all fields were incorrectly considered optional. e.g. in v1.3.2bacalhau validate
considers the below job file valid:The Change
In this PR I have implemented a complete json schema describing our new job model which lives in
schemas/job/schemas
. This schema is then embedded in the bacalhau application when it runs. The motivation here is to have avalidate
command that actually performs validation. This schema aligns with the existing validation logic we have defined in go code. Ideally we will replace the manual go validation code with the requirements of this schema at some point in the future.Caveats
docker
,wasm
,local
,s3
,ipfs
, etc) align with the definitions in the code which are typically lower case.Since this is an improvement over the prior behavior this feels acceptable. And if users take issue with this we can do some per-processing on their behalf.