bacalhau-project / bacalhau

Compute over Data framework for public, transparent, and optionally verifiable computation
https://docs.bacalhau.org
Apache License 2.0
688 stars 88 forks source link

Maybe Strict mode? #4192

Open aronchick opened 4 months ago

aronchick commented 4 months ago

Don't fail on these unless we HAVE to? And do things on the client side where possible?

Just some thinking

2024-07-02 08:09:17 - Error: failed request: Unexpected response code: 500 ({
  "error": "task main validation failed: task timeouts validation failed: execution timeout 30m0s and queue timeout 1h0m0s should be less than total timeout 30m0s",
  "message": "Internal Server Error"
})
failed request: Unexpected response code: 500 ({
  "error": "task main validation failed: task timeouts validation failed: execution timeout 30m0s and queue timeout 1h0m0s should be less than total timeout 30m0s",
  "message": "Internal Server Error"
})
2024-07-02 08:09:02 - Error: the job provided is invalid: field: 'State' in 'Job' is not allowed
the job provided is invalid: field: 'State' in 'Job' is not allowed
frrist commented 4 months ago

Please make it habit to share the job spec that resulted in these failures. I am going to make some assumptions here.

The first error is a known UX issue and has been discussed here: https://www.notion.so/Rethinking-Configuration-435fbe87419148b4bbc5119d413786eb?d=42d463663b0c4cf38f7308969687813e&pvs=4#fdc6de7c00364de1bb81b325686e6e66

A topic that came up in engineering standup regarding the feature of default values - We need to scope out what the ideal path is for users here:

  1. Do we start by using defaults and modifying the job spec on their behalf? or…
  2. Do we start by accepting a raw job spec and fail if required values are not provided and then later add an option to include fault values when specific ones are not set.

One example of where defaults hurt UX is around job timeouts. e.g. if a default value for TotalTimeout is set at 30min and a user submits a job without setting TotalTimeout and sets QueueTimeout to 60min they will get back an error stating that QueueTimeout is greater than the TotalTimeout and the user will have no idea where the TotalTimeout value came from since it was applied implicitly by default.

Since (I assume) the job spec provided didn't include a TotalTimeout field, but did (again assuming) include a QueueTimeout which was greater than the default you got this weird error message. IMO this seems like a good reason for not providing default values on the behalf of the user.

The second error is due to an unknown field being included in the job spec. IMO it seems best to fail here since simply ignoring the field will result in users including fields they thing are right, but are instead quietly ignored, leading to more confusion.

wdbaruni commented 2 months ago

The first issue should be mitigated with new config changes where we slow down on setting default values on behalf of the user.

Though the second issue is about being more forgiving on how we handle user submissions than being strict. Forrest and I had a discussion about this before. Forrest is more on the boat of being strict and fail rather than making assumptions on what the user wants. I am not more on the boat of returning warnings and carry on with job submission. This applies to ignoring fields in the job submission not part of the job spec, but can also be applied to the first issue where the total timeout can be extended on-behalf of the user to be higher than both execution and queue timeout

I don't think there is a right or wrong answer here. Obviously it is easier to be more strict and then relax, than the other way around. This is a pure product decision, and I am happy with what product decides the right behaviour here