bacalhau-project / bacalhau

Compute over Data framework for public, transparent, and optionally verifiable computation
https://docs.bacalhau.org
Apache License 2.0
649 stars 86 forks source link

Maybe Strict mode? #4192

Open aronchick opened 2 weeks ago

aronchick commented 2 weeks ago

Don't fail on these unless we HAVE to? And do things on the client side where possible?

Just some thinking

2024-07-02 08:09:17 - Error: failed request: Unexpected response code: 500 ({
  "error": "task main validation failed: task timeouts validation failed: execution timeout 30m0s and queue timeout 1h0m0s should be less than total timeout 30m0s",
  "message": "Internal Server Error"
})
failed request: Unexpected response code: 500 ({
  "error": "task main validation failed: task timeouts validation failed: execution timeout 30m0s and queue timeout 1h0m0s should be less than total timeout 30m0s",
  "message": "Internal Server Error"
})
2024-07-02 08:09:02 - Error: the job provided is invalid: field: 'State' in 'Job' is not allowed
the job provided is invalid: field: 'State' in 'Job' is not allowed
frrist commented 2 weeks ago

Please make it habit to share the job spec that resulted in these failures. I am going to make some assumptions here.

The first error is a known UX issue and has been discussed here: https://www.notion.so/Rethinking-Configuration-435fbe87419148b4bbc5119d413786eb?d=42d463663b0c4cf38f7308969687813e&pvs=4#fdc6de7c00364de1bb81b325686e6e66

A topic that came up in engineering standup regarding the feature of default values - We need to scope out what the ideal path is for users here:

  1. Do we start by using defaults and modifying the job spec on their behalf? or…
  2. Do we start by accepting a raw job spec and fail if required values are not provided and then later add an option to include fault values when specific ones are not set.

One example of where defaults hurt UX is around job timeouts. e.g. if a default value for TotalTimeout is set at 30min and a user submits a job without setting TotalTimeout and sets QueueTimeout to 60min they will get back an error stating that QueueTimeout is greater than the TotalTimeout and the user will have no idea where the TotalTimeout value came from since it was applied implicitly by default.

Since (I assume) the job spec provided didn't include a TotalTimeout field, but did (again assuming) include a QueueTimeout which was greater than the default you got this weird error message. IMO this seems like a good reason for not providing default values on the behalf of the user.

The second error is due to an unknown field being included in the job spec. IMO it seems best to fail here since simply ignoring the field will result in users including fields they thing are right, but are instead quietly ignored, leading to more confusion.