Closed frrist closed 4 months ago
Thanks Forrest for your deep dive. Though I am not in agreement with most of the issues and concerns you've raised
First, there is a precedence to what we are doing and how we are handling and filtering out system defined fields from the job submissions. Nomad's job spec is very close to what we are doing where they interleave system defined fields in the job spec. Similarly, Kubernetes allow users to define metadata
and spec
fields when submitting a job, but also return status
field when querying about the job and they do add their own metadata and it is a standard practice for CRDs to inject additional metadata in the same top level metadata field. While not recommended, users can submit jobs with status
field, and kubernetes will just ignore them.
Second, we need to think about our API design. Nomad, Kubernets and our V2 APIs are restful APIs with a unified endpoint for interacting with jobs/pods, where users can submit jobs using PUT
method and query using GET
. Based on your proposal, there will be separate endpoints to submit and others to query
As for your specific concerns:
There are several different methods on the job type that must be called to validate and sanitize it. Some of these methods must only be called on the client side, while others must only be called on the server side. This is poor DevEx.
I don't understand why this is poor DevEx? Again this is what both Kubernetes and Nomad do, and I don't see the problem. Besides, we allow the requester to set default fields if they are missing in the job, and we will benefit from different validation at the submission layer where things can be more relaxed, and later validation after populating defaults.
The meta field of a job has various "reserved" keys that cannot be set by users, if they are we delete them: https://github.com/bacalhau-project/bacalhau/blob/main/pkg/models/job.go#L307. This is both poor DevEx and UX.
Again there is a precedence in Kubernetes with this, and it is the main way for CRDs to inject their own state and metadata. I would prefer this than having multiple places to define metadata for the same job
We only disallow user submissions with the JobID set if the job type is batch or ops. system and daemon jobs may provide their own jobID on submission: https://github.com/bacalhau-project/bacalhau/blob/main/pkg/models/job.go#L295-L300.
Yeah we can remove this logic as it can be confusing. The original intention was to allow users to provide their own job names and be able to update long running jobs, but not update batch and ops jobs. Job updates is a bigger discussion and we have other issues to track this including https://github.com/bacalhau-project/bacalhau/issues/3871 and https://github.com/bacalhau-project/bacalhau/issues/3864. Most likely we will allow users to update batch and ops jobs as well.
We inefficiency store the entire job structure, and its tasks(!!), repeatedly to disk when only modifying a small portion of the structure: https://github.com/bacalhau-project/bacalhau/blob/main/pkg/jobstore/boltdb/store.go#L848-L852. The State of a job doesn't need to be in the same structure as the users submission. It would be significantly more efficient to store the user submission once and make updates to system state representation separately.
You need to understand the access pattern of our jobs and APIs before concluding if we are doing things inefficiently or not. We almost never retrieve the job spec without its state. Previously when we had job spec and state as separate objects, we always had to do two queries to the database to fetch both. It is true that we are doing heavier writes, but our access pattern is heavier on the read side. Still, we can benefit from a model that is closer to Kubernetes where we have top level fields for metadata
, spec
and state
. This allows job store implementations to optimize their storage for partial updates.
Whenever the time comes to make changes to the system state representation of the job or task (e.g. add a new field, remove a field, modify a fields type, etc.) we will need to modify everything from the client to the server to accommodate it. Ohh and let's not forget the various SDKs!
I don't understand this. If we add a new system state representation, we will need to update the SDKs regardless to be able to display this field. This is regardless if those system fields are interleaved in the job model, or if they are under a top level state
field
Running bacalhau job describe may return a job spec that is different from what the users actually submitted. For example, we modify the count field depending on the job type a user submitted: https://github.com/bacalhau-project/bacalhau/blob/main/pkg/models/job.go#L166
We allow requester nodes to set job defaults if they are missing, such as timeout, job resources and publisher. I don't see this is a problem. I am more concerned in understanding how the system handled my job and what fields it assumed when running the job, rather than making sure bacalhau describe
to return exactly what I submitted. Also, Kubernetes and Nomad do inject system information in the submitted job that you will see when you try to get
the job
Running bacalhau job describe --output yaml | bacalhau job run and getting several warning messages back about fields being ignored is confusing. As a user I would expect this to "just work" with out warning.
First, Kubernetes ignore those fields without a warning. We can remove the warnings in the future if they are painful. Second, this model of re-submitting jobs may not be useful anyways and not common. Third, users are different and we need to collect more feedback from users before making drastic changes to our APIs. Saying that As a user I would expect this to "just work" with out warning. is a big assumption
Users must carry the cognitive overhead of knowing which fields they can set and which they can't when submitting jobs. This is called out plainly in our documentation: https://expanso.gitbook.io/bacalhau-docs/setting-up/jobs/job#server-generated-parameters.
I think you are over complicating this. Having reserved prefixes for metadata doesn't sound like something painful for users to work with, specially as we are following a pattern where we prefix all those reserved metadata with bacalhau.org/
My overall feedback is that I am sure our APIs will go through another iteration at some point, but lets do that after we have more users and collect more feedback to make informed decisions. The current APIs are functional, have precedence in Nomad, and I don't see will cause us problems for at least 2 years.
We have already opted for fixing a lot of technical debt this quarter, and we need to move forward and deliver new features rather than getting stuck in this phase. When we have more capacity and more users, we will not only explore the structure of our API models, but also the need for separate APIs and CLIs for the different job types as we've discussed before.
Closing due to the controversial nature of this change in discussions here and synchronously.
At present the system representation of a job and user submitted data are combined together into a single structure, the Job type.
Problem
It feels like we are making this easy for ourselves, instead of ours users. Or what's more likely - making this hard for BOTH ourselves and users. At present we are munging together user submitted data and system data into multiple structures - the job and its tasks. Further we are making modifications to the data submitted by users depending on various edge cases. Lastly, we are inefficiently storing job state when persisting to disk in the jobstore.
Ways we are making it harder for ourselves:
SanitizeSubmission
only exists because we allow users to submit system fields in the first place.Ways we are making it hard for our users
bacalhau job describe
may return a job spec that is different from what the users actually submitted. For example, we modify the count field depending on the job type a user submitted: https://github.com/bacalhau-project/bacalhau/blob/main/pkg/models/job.go#L166bacalhau job describe --output yaml | bacalhau job run
and getting several warning messages back about fields being ignored is confusing. As a user I would expect this to "just work" with out warning.Proposal
Separate the system representation of a job and the data expected from users into two different types. As an example, the
JobState
is used by bacalhau to monitor and track various system attributes of a job. TheJobSpec
is the schema/type users are expected to submit.