OpenFn / kit

The bits & pieces that make OpenFn work. (diagrammer, cli, compiler, runtime, runtime manager, logger, etc.)
10 stars 9 forks source link

Exit Reasons #396

Closed josephjclark closed 11 months ago

josephjclark commented 1 year ago

When a job finishes, the worker needs to include an exit status, reason and message in the event.

Here's an attempt at capturing the reasons

Priority Status Exit Reason Exit Message
0 Planned - - job pending
1 success OK - completed without throwing
2 fail - completed but threw an error (error on state)
RUNTIME_ERROR state is not defined Some kind of js runtime exception (likely a reference error)
ERROR Patient not found, http 500? Error thrown by a user or adaptor (ie throw new Error('abort!'))
IMPORT_ERROR failed to import a dependency (likely an adaptor problem)
SECURITY_ERROR code generation disabled User tried to execute code that we deem insecure
3 crash - Job did not execute or return
OOM Out of memory error occured on the job
TIMEOUT Timeout occured from the job
COMPILE_ERROR Failed to compile the job code, likely a syntax error
CRASH Stand-in for some kind of crash without a known reason

Table generated from: https://www.tablesgenerator.com/markdown_tables

Other errors: DISK_SPACE

The reason should be a short-ish list of enums (Lightning will go on to classify the exist reason into a status to report to the user). The message, which is optional, adds context and detail and may be displayed to the user.

The attempt will also have an exit status and reason. Right now, this will be the highest priority status/reason found across the workflow. But I think we'll need to look at that later because a user-handled, expected error may skew the result.

This is ostensibly a worker issue but it is likely to have knock-on work in the engine and maybe even runtime

Related is https://github.com/OpenFn/kit/issues/394 - I think we can plug in a reasons framework first, and then work #394 into it after.

stuartc commented 1 year ago

This looks good @josephjclark, might I suggest a "NORMAL" or "OK" in the case of a successful exit? It makes things easier on the lightning side to have a concrete value for giving a green tick etc.

But yeah I do like how you've structured it under different groupings :ok_hand:

josephjclark commented 1 year ago

I don't think the table above is quite right, it's a bit hard to map the reasons to priorities (or statuses). It's also a little inflexible.

What if every attempt complete returns { exit_status, exit_reason, exit_message } (note that if the attempt crashes, it'll still return an attempt complete event (if it can, otherwise lightning has to time it out))

priority status reason message Comment
1 OK . . Attempt completed without errors
2 Fail RUNTIME_ERROR ReferenceError: state is not defined Error thrown while executing the job
IMPORT_ERROR Failed to load adaptor: common@1.0.0 Any issue loading dependencies through the linker (may not be an adaptor)
SECURITY_ERROR Eval is disabled for this context Any security malpractice we may catch. Perhaps this is acrash?
ERROR General error
... We can freely support other error reasons
3 Crash DATACLIP_LOAD_FAIL Failed to load dataclip 'xyz' Failed to load the initial dataclip for some reason.
CREDENTIAL_LOAD_FAIL Failed to load credential 'abc' Failed to load a credential for some reason
OOM Job exeeded memory threshold of 500mb
TIMEOUT Job exceeded thet timeout of 5000ms
COMPILER_ERROR Unexpected token '@' at line 1, col 1 Maybe SYNTAX_ERROR is better. We may have compiler specific stuff though
CRASH General catch off for unexpected errors (ie if the runtime throws)

A fail means the attempt completed normally in the runtime, but may have one or more errors on state. But it does return a state object.

A crash means the attempt failed to execute and state will not be returned.

In this approach we could encode status as a number, 1 2 or 3. Reason and message are strings. Reason would be a finite list which could slowly expand over time.

josephjclark commented 1 year ago

Thinking about a simpler ?() alternative. For a Fail state, we send a message to attempt:complete with errors. For a crash state, we message attempt:error.

So rather than returning to the same endpoint with different "error codes", we return to different endpoints.

attempt:complete means "I ran the job and it's all fine but there may have been some errors raised along the way, which the user should take a look at"

attempt:error means "I failed to execute this workflow and have aborted early for some very good reason"

josephjclark commented 1 year ago

I'm also starting to wonder again about the crash/fail distinction. Why should OOM be more critical than IMPORT_ERROR? Or RUNTIME_ERROR?

Maybe I've just got the distinction wrong here. These are all "failed to execute" errors rather than user or adaptor thrown errors. Maybe even that doesn't work - what happens when Salesforce throws a 401 or http throws a 500?

Some crash states here are user error (like syntax error from the compiler) and require user attention (probably oom too). And some fail states, like "sorry I failed to load this adaptor", are probably system/admin issues and are bugs, rather than user error.

Also a bit worried that "crash doesn't return state" thing isn't right. If job A ran but job B timed out, we probably want to inspect the state of job a so that we can re-run or try to reproduce locally.

josephjclark commented 12 months ago

I think this is where we've landed on the statuses:

reason summary example
ok no errors at all
cancelled user killed the job for some reason user aborted the job in lightning
fail Error came out of the job but we continued execution as much as we could http threw a 500 code; user threw their own Error
crash Something went severely wrong and the engine or worker aborted execution early error loading credentials, runtime exception, syntax error
killed Something happened we didn't like and we killed/aborted the workfllow early security error, admin shut down server, timeout, oom
lost Lightning struck off the attempt because it didn't complete in time Probably means the worker lost conection to lightning

In addition:

josephjclark commented 12 months ago

Taylor wants the attempt reason to be calculated in the worker

josephjclark commented 11 months ago

Final spec for run complete payloads (in pseudo typescript):

reason: 'ok' | 'cancelled' | 'failed' | 'killed' | 'crashed' (I don't send 'lost', lightning must infer it)
dataclip: Object,
message?: string // human readable, user-friendly, english language string
error_type?: string // a finite (but growing) list of known errors, ie, ReferenceError, AdaptorError, OOMError
source?: what component (runtime, worker, engine, compiler) reported the error

Questions / comments

I am pretty sure that the attempt:complete payload will be exactly the same, but may want to include a run id (ie, the run the fail state came from) and will have a dataclip id, rather than an in-line dataclip.

taylordowns2000 commented 11 months ago

@josephjclark , let's chat about this on Monday—we are getting the run:complete event on Lightning when a run is successful, but not when it's got a good error. here are the lightning logs (with ws-worker inline)

job one "good" succeeds. job two "bad" fails.

image
[debug] QUERY OK db=0.2ms idle=295.1ms
begin []
[debug] QUERY OK source="attempts" db=27.3ms
...
commit []
[debug] HANDLED claim INCOMING ON worker:queue (LightningWeb.WorkerChannel) in 29ms
  Parameters: %{"demand" => 1}
[debug] QUERY OK db=102.7ms idle=467.8ms
begin []
...
[info] [R/T] ❯ Executing expression (1 operations)
[info] [R/T] ❯ Starting operation 1
[info] [R/T] ℹ Operation 1 complete in 0ms
[info] [R/T] ❯ Expression complete!
[info] [R/T] ❯ {
[info]   "configuration": {},
[info]   "data": {}
[info] }
[info] [R/T] ✔ Completed job f2367ddd-ab0d-4a2d-9a6a-51eb91f3313c in 352ms
[info] [R/T] ♦ Starting job 7ea4acab-96f4-4b32-bc2f-896ee4a961e3
[info] [R/T] ❯ Intialising pipeline
[info] [R/T] ❯ Timeout set to 300000ms
[info] [R/T] ❯ [linker] loading module @openfn/language-http
[info] [R/T] ❯ [linker] Loading module @openfn/language-http from /tmp/openfn/repo/node_modules/@openfn/language-http_latest/dist/index.cjs
[info] [R/T] ℹ Resolved adaptor @openfn/language-http to version 5.0.4
[debug] QUERY OK db=70.2ms
commit []
[debug] HANDLED claim INCOMING ON worker:queue (LightningWeb.WorkerChannel) in 187ms
  Parameters: %{"demand" => 1}
...
begin []
[debug] QUERY OK db=0.5ms
...
[debug] QUERY OK db=0.2ms
...
[debug] QUERY OK db=0.3ms
commit []
[debug] HANDLED run:start INCOMING ON attempt:02bcee87-fa8d-4dae-bc6a-b1c7538c237a (LightningWeb.AttemptChannel) in 22ms
  Parameters: %{"input_dataclip_id" => "0e189878-95b3-4ec1-b294-0a7f36feb9c8", "job_id" => "f2367ddd-ab0d-4a2d-9a6a-51eb91f3313c", "run_id" => "82301f39-c9e4-4204-aa52-adb377b12bcc"}
[info] [SRV] ℹ 02bcee87-fa8d-4dae-bc6a-b1c7538c237a :: run:start :: OK
[debug] QUERY OK db=0.1ms idle=654.2ms
begin []
[debug] QUERY OK source="runs" db=0.1ms
...
[debug] HANDLED run:complete INCOMING ON attempt:02bcee87-fa8d-4dae-bc6a-b1c7538c237a (LightningWeb.AttemptChannel) in 2ms
  Parameters: %{"job_id" => "f2367ddd-ab0d-4a2d-9a6a-51eb91f3313c", "output_dataclip" => "{\"configuration\":{},\"data\":{}}", "output_dataclip_id" => "282edadf-9789-4e99-9853-c4c45812257a", "reason" => "success", "run_id" => "82301f39-c9e4-4204-aa52-adb377b12bcc"}
[info] [SRV] ℹ 02bcee87-fa8d-4dae-bc6a-b1c7538c237a :: run:complete :: OK
[info] [R/T] ❯ Executing expression (1 operations)
[info] [R/T] ❯ Starting operation 1
[debug] QUERY OK db=1.1ms idle=794.2ms
begin []
[debug] QUERY OK source="attempts" db=0.4ms
...
commit []
[debug] HANDLED claim INCOMING ON worker:queue (LightningWeb.WorkerChannel) in 2ms
  Parameters: %{"demand" => 1}
[debug] QUERY OK source="attempts" db=51.1ms idle=756.5ms
...
commit []
[debug] HANDLED run:start INCOMING ON attempt:02bcee87-fa8d-4dae-bc6a-b1c7538c237a (LightningWeb.AttemptChannel) in 54ms
  Parameters: %{"input_dataclip_id" => "282edadf-9789-4e99-9853-c4c45812257a", "job_id" => "7ea4acab-96f4-4b32-bc2f-896ee4a961e3", "run_id" => "4dc2a437-d345-444a-b0ba-9aee0f9598d0"}
[info] [SRV] ℹ 02bcee87-fa8d-4dae-bc6a-b1c7538c237a :: run:start :: OK
[info] [R/T] ✘ Failed job 7ea4acab-96f4-4b32-bc2f-896ee4a961e3 after 251ms
[info] [R/T] ✘ ERR_BAD_REQUEST: Request failed with status code 405
[info] [R/T] ❯ {
[info]   "message": "Request failed with status code 405",
[info]   "name": "AxiosError",
[info]   "stack": "AxiosError: Request failed with status code 405\n    at settle (/private/tmp/openfn/repo/node_modules/axios/dist/node/axios.cjs:1268:12)\n    at IncomingMessage.handleStreamEnd (/private/tmp/openfn/repo/node_modules/axios/dist/node/axios.cjs:2446:11)\n    at IncomingMessage.emit (node:events:527:35)\n    at endReadableNT (node:internal/streams/readable:1589:12)\n    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)",
[info]   "config": {
[info]     "transitional": {
[info]       "silentJSONParsing": true,
[info]       "forcedJSONParsing": true,
[info]       "clarifyTimeoutError": false
[info]     },
[info]     "transformRequest": [
[info]       null
[info]     ],
[info]     "transformResponse": [
[info]       null
[info]     ],
[info]     "timeout": 0,
[info]     "xsrfCookieName": "XSRF-TOKEN",
[info]     "xsrfHeaderName": "X-XSRF-TOKEN",
[info]     "maxContentLength": -1,
[info]     "maxBodyLength": -1,
[info]     "env": {},
[info]     "headers": {
[info]       "Accept": "application/json, text/plain, */*",
[info]       "Content-Type": "application/x-www-form-urlencoded",
[info]       "cookie": "",
[info]       "User-Agent": "axios/1.1.3",
[info]       "Accept-Encoding": "gzip, deflate, br"
[info]     },
iex(2)> DOMException [DataCloneError]: function httpAdapter(config) {
                                                                       return new Promise(function dispatchHttpRequest(resolvePromise, rejectPromise)...<omitted>...
                                                                                                                                                                   } could not be cloned.
                         at new DOMException (node:internal/per_context/domexception:53:5)
                                                                                              at worker.emit (/Users/taylor/build/Lightning/assets/node_modules/workerpool/src/worker.js:252:12)
                                at Object.workerEmit (/Users/taylor/build/Lightning/assets/node_modules/workerpool/src/index.js:31:10)
                                                                                                                                          at publish (file:///Users/taylor/build/Lightning/assets/node_modules/@openfn/engine-multi/dist/worker/worker.js:49:14)
                                                                                                at helper (file:///Users/taylor/build/Lightning/assets/node_modules/@openfn/engine-multi/dist/worker/worker.js:60:5)
                                                    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
                                                                                                                         [info]     "method": "post",
[info]     "responseType": "json",
[info]     "maxRedirects": 5,
[info]     "url": "https://docs.openfn.org/this-wont-work",
[info]     "params": {},
[info]     "data": null
[info]   },
[info]   "code": "ERR_BAD_REQUEST",
[info]   "status": 405
[info] }
[info] [R/T] ✘ Check state.errors.7ea4acab-96f4-4b32-bc2f-896ee4a961e3 for details.
[info] [RTE] ❯ Purging workers

I think this will best be solved with a pairing session. I'll run lightning and start the worker with a watch, you liveShare into the worker and we run through a few scenarios.

josephjclark commented 11 months ago

A change!! ok -> success

josephjclark commented 11 months ago

Not handling every case yet but I am tracking the most critical and the design is fundamentally done.