Closed josephjclark closed 11 months ago
This looks good @josephjclark, might I suggest a "NORMAL" or "OK" in the case of a successful exit? It makes things easier on the lightning side to have a concrete value for giving a green tick etc.
But yeah I do like how you've structured it under different groupings :ok_hand:
I don't think the table above is quite right, it's a bit hard to map the reasons to priorities (or statuses). It's also a little inflexible.
What if every attempt complete returns { exit_status, exit_reason, exit_message }
(note that if the attempt crashes, it'll still return an attempt complete event (if it can, otherwise lightning has to time it out))
priority | status | reason | message | Comment |
---|---|---|---|---|
1 | OK | . | . | Attempt completed without errors |
2 | Fail | RUNTIME_ERROR | ReferenceError: state is not defined | Error thrown while executing the job |
IMPORT_ERROR | Failed to load adaptor: common@1.0.0 | Any issue loading dependencies through the linker (may not be an adaptor) | ||
SECURITY_ERROR | Eval is disabled for this context | Any security malpractice we may catch. Perhaps this is acrash? | ||
ERROR | General error | |||
... | We can freely support other error reasons | |||
3 | Crash | DATACLIP_LOAD_FAIL | Failed to load dataclip 'xyz' | Failed to load the initial dataclip for some reason. |
CREDENTIAL_LOAD_FAIL | Failed to load credential 'abc' | Failed to load a credential for some reason | ||
OOM | Job exeeded memory threshold of 500mb | |||
TIMEOUT | Job exceeded thet timeout of 5000ms | |||
COMPILER_ERROR | Unexpected token '@' at line 1, col 1 | Maybe SYNTAX_ERROR is better. We may have compiler specific stuff though | ||
CRASH | General catch off for unexpected errors (ie if the runtime throws) |
A fail means the attempt completed normally in the runtime, but may have one or more errors on state. But it does return a state object.
A crash means the attempt failed to execute and state will not be returned.
In this approach we could encode status as a number, 1 2 or 3. Reason and message are strings. Reason would be a finite list which could slowly expand over time.
Thinking about a simpler ?() alternative. For a Fail state, we send a message to attempt:complete with errors. For a crash state, we message attempt:error.
So rather than returning to the same endpoint with different "error codes", we return to different endpoints.
attempt:complete means "I ran the job and it's all fine but there may have been some errors raised along the way, which the user should take a look at"
attempt:error means "I failed to execute this workflow and have aborted early for some very good reason"
I'm also starting to wonder again about the crash/fail distinction. Why should OOM be more critical than IMPORT_ERROR? Or RUNTIME_ERROR?
Maybe I've just got the distinction wrong here. These are all "failed to execute" errors rather than user or adaptor thrown errors. Maybe even that doesn't work - what happens when Salesforce throws a 401 or http throws a 500?
Some crash states here are user error (like syntax error from the compiler) and require user attention (probably oom too). And some fail states, like "sorry I failed to load this adaptor", are probably system/admin issues and are bugs, rather than user error.
Also a bit worried that "crash doesn't return state" thing isn't right. If job A ran but job B timed out, we probably want to inspect the state of job a so that we can re-run or try to reproduce locally.
I think this is where we've landed on the statuses:
reason | summary | example |
---|---|---|
ok | no errors at all | |
cancelled | user killed the job for some reason | user aborted the job in lightning |
fail | Error came out of the job but we continued execution as much as we could | http threw a 500 code; user threw their own Error |
crash | Something went severely wrong and the engine or worker aborted execution early | error loading credentials, runtime exception, syntax error |
killed | Something happened we didn't like and we killed/aborted the workfllow early | security error, admin shut down server, timeout, oom |
lost | Lightning struck off the attempt because it didn't complete in time | Probably means the worker lost conection to lightning |
In addition:
{ reason, error_type, message }
to run:complete
{ source }
key, ie, compiler, runtime, engine, workerTaylor wants the attempt reason to be calculated in the worker
Final spec for run complete payloads (in pseudo typescript):
reason: 'ok' | 'cancelled' | 'failed' | 'killed' | 'crashed' (I don't send 'lost', lightning must infer it)
dataclip: Object,
message?: string // human readable, user-friendly, english language string
error_type?: string // a finite (but growing) list of known errors, ie, ReferenceError, AdaptorError, OOMError
source?: what component (runtime, worker, engine, compiler) reported the error
Questions / comments
ok
or success
?error_type
, but type
is too vague. Maybe error: ReferenceError
is ok?I am pretty sure that the attempt:complete payload will be exactly the same, but may want to include a run id (ie, the run the fail state came from) and will have a dataclip id, rather than an in-line dataclip.
@josephjclark , let's chat about this on Monday—we are getting the run:complete
event on Lightning when a run is successful, but not when it's got a good error. here are the lightning logs (with ws-worker inline)
job one "good" succeeds. job two "bad" fails.
[debug] QUERY OK db=0.2ms idle=295.1ms
begin []
[debug] QUERY OK source="attempts" db=27.3ms
...
commit []
[debug] HANDLED claim INCOMING ON worker:queue (LightningWeb.WorkerChannel) in 29ms
Parameters: %{"demand" => 1}
[debug] QUERY OK db=102.7ms idle=467.8ms
begin []
...
[info] [R/T] ❯ Executing expression (1 operations)
[info] [R/T] ❯ Starting operation 1
[info] [R/T] ℹ Operation 1 complete in 0ms
[info] [R/T] ❯ Expression complete!
[info] [R/T] ❯ {
[info] "configuration": {},
[info] "data": {}
[info] }
[info] [R/T] ✔ Completed job f2367ddd-ab0d-4a2d-9a6a-51eb91f3313c in 352ms
[info] [R/T] ♦ Starting job 7ea4acab-96f4-4b32-bc2f-896ee4a961e3
[info] [R/T] ❯ Intialising pipeline
[info] [R/T] ❯ Timeout set to 300000ms
[info] [R/T] ❯ [linker] loading module @openfn/language-http
[info] [R/T] ❯ [linker] Loading module @openfn/language-http from /tmp/openfn/repo/node_modules/@openfn/language-http_latest/dist/index.cjs
[info] [R/T] ℹ Resolved adaptor @openfn/language-http to version 5.0.4
[debug] QUERY OK db=70.2ms
commit []
[debug] HANDLED claim INCOMING ON worker:queue (LightningWeb.WorkerChannel) in 187ms
Parameters: %{"demand" => 1}
...
begin []
[debug] QUERY OK db=0.5ms
...
[debug] QUERY OK db=0.2ms
...
[debug] QUERY OK db=0.3ms
commit []
[debug] HANDLED run:start INCOMING ON attempt:02bcee87-fa8d-4dae-bc6a-b1c7538c237a (LightningWeb.AttemptChannel) in 22ms
Parameters: %{"input_dataclip_id" => "0e189878-95b3-4ec1-b294-0a7f36feb9c8", "job_id" => "f2367ddd-ab0d-4a2d-9a6a-51eb91f3313c", "run_id" => "82301f39-c9e4-4204-aa52-adb377b12bcc"}
[info] [SRV] ℹ 02bcee87-fa8d-4dae-bc6a-b1c7538c237a :: run:start :: OK
[debug] QUERY OK db=0.1ms idle=654.2ms
begin []
[debug] QUERY OK source="runs" db=0.1ms
...
[debug] HANDLED run:complete INCOMING ON attempt:02bcee87-fa8d-4dae-bc6a-b1c7538c237a (LightningWeb.AttemptChannel) in 2ms
Parameters: %{"job_id" => "f2367ddd-ab0d-4a2d-9a6a-51eb91f3313c", "output_dataclip" => "{\"configuration\":{},\"data\":{}}", "output_dataclip_id" => "282edadf-9789-4e99-9853-c4c45812257a", "reason" => "success", "run_id" => "82301f39-c9e4-4204-aa52-adb377b12bcc"}
[info] [SRV] ℹ 02bcee87-fa8d-4dae-bc6a-b1c7538c237a :: run:complete :: OK
[info] [R/T] ❯ Executing expression (1 operations)
[info] [R/T] ❯ Starting operation 1
[debug] QUERY OK db=1.1ms idle=794.2ms
begin []
[debug] QUERY OK source="attempts" db=0.4ms
...
commit []
[debug] HANDLED claim INCOMING ON worker:queue (LightningWeb.WorkerChannel) in 2ms
Parameters: %{"demand" => 1}
[debug] QUERY OK source="attempts" db=51.1ms idle=756.5ms
...
commit []
[debug] HANDLED run:start INCOMING ON attempt:02bcee87-fa8d-4dae-bc6a-b1c7538c237a (LightningWeb.AttemptChannel) in 54ms
Parameters: %{"input_dataclip_id" => "282edadf-9789-4e99-9853-c4c45812257a", "job_id" => "7ea4acab-96f4-4b32-bc2f-896ee4a961e3", "run_id" => "4dc2a437-d345-444a-b0ba-9aee0f9598d0"}
[info] [SRV] ℹ 02bcee87-fa8d-4dae-bc6a-b1c7538c237a :: run:start :: OK
[info] [R/T] ✘ Failed job 7ea4acab-96f4-4b32-bc2f-896ee4a961e3 after 251ms
[info] [R/T] ✘ ERR_BAD_REQUEST: Request failed with status code 405
[info] [R/T] ❯ {
[info] "message": "Request failed with status code 405",
[info] "name": "AxiosError",
[info] "stack": "AxiosError: Request failed with status code 405\n at settle (/private/tmp/openfn/repo/node_modules/axios/dist/node/axios.cjs:1268:12)\n at IncomingMessage.handleStreamEnd (/private/tmp/openfn/repo/node_modules/axios/dist/node/axios.cjs:2446:11)\n at IncomingMessage.emit (node:events:527:35)\n at endReadableNT (node:internal/streams/readable:1589:12)\n at process.processTicksAndRejections (node:internal/process/task_queues:82:21)",
[info] "config": {
[info] "transitional": {
[info] "silentJSONParsing": true,
[info] "forcedJSONParsing": true,
[info] "clarifyTimeoutError": false
[info] },
[info] "transformRequest": [
[info] null
[info] ],
[info] "transformResponse": [
[info] null
[info] ],
[info] "timeout": 0,
[info] "xsrfCookieName": "XSRF-TOKEN",
[info] "xsrfHeaderName": "X-XSRF-TOKEN",
[info] "maxContentLength": -1,
[info] "maxBodyLength": -1,
[info] "env": {},
[info] "headers": {
[info] "Accept": "application/json, text/plain, */*",
[info] "Content-Type": "application/x-www-form-urlencoded",
[info] "cookie": "",
[info] "User-Agent": "axios/1.1.3",
[info] "Accept-Encoding": "gzip, deflate, br"
[info] },
iex(2)> DOMException [DataCloneError]: function httpAdapter(config) {
return new Promise(function dispatchHttpRequest(resolvePromise, rejectPromise)...<omitted>...
} could not be cloned.
at new DOMException (node:internal/per_context/domexception:53:5)
at worker.emit (/Users/taylor/build/Lightning/assets/node_modules/workerpool/src/worker.js:252:12)
at Object.workerEmit (/Users/taylor/build/Lightning/assets/node_modules/workerpool/src/index.js:31:10)
at publish (file:///Users/taylor/build/Lightning/assets/node_modules/@openfn/engine-multi/dist/worker/worker.js:49:14)
at helper (file:///Users/taylor/build/Lightning/assets/node_modules/@openfn/engine-multi/dist/worker/worker.js:60:5)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
[info] "method": "post",
[info] "responseType": "json",
[info] "maxRedirects": 5,
[info] "url": "https://docs.openfn.org/this-wont-work",
[info] "params": {},
[info] "data": null
[info] },
[info] "code": "ERR_BAD_REQUEST",
[info] "status": 405
[info] }
[info] [R/T] ✘ Check state.errors.7ea4acab-96f4-4b32-bc2f-896ee4a961e3 for details.
[info] [RTE] ❯ Purging workers
I think this will best be solved with a pairing session. I'll run lightning and start the worker with a watch, you liveShare into the worker and we run through a few scenarios.
A change!! ok
-> success
Not handling every case yet but I am tracking the most critical and the design is fundamentally done.
When a job finishes, the worker needs to include an exit status, reason and message in the event.
Here's an attempt at capturing the reasons
Table generated from: https://www.tablesgenerator.com/markdown_tables
Other errors: DISK_SPACE
The reason should be a short-ish list of enums (Lightning will go on to classify the exist reason into a status to report to the user). The message, which is optional, adds context and detail and may be displayed to the user.
The attempt will also have an exit status and reason. Right now, this will be the highest priority status/reason found across the workflow. But I think we'll need to look at that later because a user-handled, expected error may skew the result.
This is ostensibly a worker issue but it is likely to have knock-on work in the engine and maybe even runtime
Related is https://github.com/OpenFn/kit/issues/394 - I think we can plug in a reasons framework first, and then work #394 into it after.