Exit Reasons - Githubissues

josephjclark commented 1 year ago

When a job finishes, the worker needs to include an exit status, reason and message in the event.

Here's an attempt at capturing the reasons

Priority	Status	Exit Reason	Exit Message
0	Planned	-	-	job pending
1	success	OK	-	completed without throwing
2	fail		-	completed but threw an error (error on state)
		RUNTIME_ERROR	state is not defined	Some kind of js runtime exception (likely a reference error)
		ERROR	Patient not found, http 500?	Error thrown by a user or adaptor (ie throw new Error('abort!'))
		IMPORT_ERROR		failed to import a dependency (likely an adaptor problem)
		SECURITY_ERROR	code generation disabled	User tried to execute code that we deem insecure
3	crash		-	Job did not execute or return
		OOM		Out of memory error occured on the job
		TIMEOUT		Timeout occured from the job
		COMPILE_ERROR		Failed to compile the job code, likely a syntax error
		CRASH		Stand-in for some kind of crash without a known reason

Table generated from: https://www.tablesgenerator.com/markdown_tables

Other errors: DISK_SPACE

The reason should be a short-ish list of enums (Lightning will go on to classify the exist reason into a status to report to the user). The message, which is optional, adds context and detail and may be displayed to the user.

The attempt will also have an exit status and reason. Right now, this will be the highest priority status/reason found across the workflow. But I think we'll need to look at that later because a user-handled, expected error may skew the result.

This is ostensibly a worker issue but it is likely to have knock-on work in the engine and maybe even runtime

Related is https://github.com/OpenFn/kit/issues/394 - I think we can plug in a reasons framework first, and then work #394 into it after.

stuartc commented 1 year ago

This looks good @josephjclark, might I suggest a "NORMAL" or "OK" in the case of a successful exit? It makes things easier on the lightning side to have a concrete value for giving a green tick etc.

But yeah I do like how you've structured it under different groupings :ok_hand:

josephjclark commented 1 year ago

I don't think the table above is quite right, it's a bit hard to map the reasons to priorities (or statuses). It's also a little inflexible.

What if every attempt complete returns { exit_status, exit_reason, exit_message } (note that if the attempt crashes, it'll still return an attempt complete event (if it can, otherwise lightning has to time it out))

priority	status	reason	message	Comment
1	OK	.	.	Attempt completed without errors
2	Fail	RUNTIME_ERROR	ReferenceError: state is not defined	Error thrown while executing the job
		IMPORT_ERROR	Failed to load adaptor: common@1.0.0	Any issue loading dependencies through the linker (may not be an adaptor)
		SECURITY_ERROR	Eval is disabled for this context	Any security malpractice we may catch. Perhaps this is acrash?
		ERROR		General error
		...		We can freely support other error reasons
3	Crash	DATACLIP_LOAD_FAIL	Failed to load dataclip 'xyz'	Failed to load the initial dataclip for some reason.
		CREDENTIAL_LOAD_FAIL	Failed to load credential 'abc'	Failed to load a credential for some reason
		OOM	Job exeeded memory threshold of 500mb
		TIMEOUT	Job exceeded thet timeout of 5000ms
		COMPILER_ERROR	Unexpected token '@' at line 1, col 1	Maybe SYNTAX_ERROR is better. We may have compiler specific stuff though
		CRASH		General catch off for unexpected errors (ie if the runtime throws)

A fail means the attempt completed normally in the runtime, but may have one or more errors on state. But it does return a state object.

A crash means the attempt failed to execute and state will not be returned.

In this approach we could encode status as a number, 1 2 or 3. Reason and message are strings. Reason would be a finite list which could slowly expand over time.

josephjclark commented 1 year ago

Thinking about a simpler ?() alternative. For a Fail state, we send a message to attempt:complete with errors. For a crash state, we message attempt:error.

So rather than returning to the same endpoint with different "error codes", we return to different endpoints.

attempt:complete means "I ran the job and it's all fine but there may have been some errors raised along the way, which the user should take a look at"

attempt:error means "I failed to execute this workflow and have aborted early for some very good reason"

josephjclark commented 1 year ago

I'm also starting to wonder again about the crash/fail distinction. Why should OOM be more critical than IMPORT_ERROR? Or RUNTIME_ERROR?

Maybe I've just got the distinction wrong here. These are all "failed to execute" errors rather than user or adaptor thrown errors. Maybe even that doesn't work - what happens when Salesforce throws a 401 or http throws a 500?

Some crash states here are user error (like syntax error from the compiler) and require user attention (probably oom too). And some fail states, like "sorry I failed to load this adaptor", are probably system/admin issues and are bugs, rather than user error.

Also a bit worried that "crash doesn't return state" thing isn't right. If job A ran but job B timed out, we probably want to inspect the state of job a so that we can re-run or try to reproduce locally.

josephjclark commented 12 months ago

I think this is where we've landed on the statuses:

reason	summary	example
ok	no errors at all
cancelled	user killed the job for some reason	user aborted the job in lightning
fail	Error came out of the job but we continued execution as much as we could	http threw a 500 code; user threw their own Error
crash	Something went severely wrong and the engine or worker aborted execution early	error loading credentials, runtime exception, syntax error
killed	Something happened we didn't like and we killed/aborted the workfllow early	security error, admin shut down server, timeout, oom
lost	Lightning struck off the attempt because it didn't complete in time	Probably means the worker lost conection to lightning

In addition:

We send errors with { reason, error_type, message } to run:complete
There will always be a human-readable message
I will include a { source } key, ie, compiler, runtime, engine, worker
Attempt reasons we'll work out later - this is RUN reasons

josephjclark commented 12 months ago

Taylor wants the attempt reason to be calculated in the worker

josephjclark commented 11 months ago

Final spec for run complete payloads (in pseudo typescript):

reason: 'ok' | 'cancelled' | 'failed' | 'killed' | 'crashed' (I don't send 'lost', lightning must infer it)
dataclip: Object,
message?: string // human readable, user-friendly, english language string
error_type?: string // a finite (but growing) list of known errors, ie, ReferenceError, AdaptorError, OOMError
source?: what component (runtime, worker, engine, compiler) reported the error

Questions / comments

Are reasons past tense (cancelled) or present (cancel) ?
ok or success?
I don't like error_type, but type is too vague. Maybe error: ReferenceError is ok?

I am pretty sure that the attempt:complete payload will be exactly the same, but may want to include a run id (ie, the run the fail state came from) and will have a dataclip id, rather than an in-line dataclip.

taylordowns2000 commented 11 months ago

@josephjclark , let's chat about this on Monday—we are getting the run:complete event on Lightning when a run is successful, but not when it's got a good error. here are the lightning logs (with ws-worker inline)

job one "good" succeeds. job two "bad" fails.

[debug] QUERY OK db=0.2ms idle=295.1ms
begin []
[debug] QUERY OK source="attempts" db=27.3ms
...
commit []
[debug] HANDLED claim INCOMING ON worker:queue (LightningWeb.WorkerChannel) in 29ms
  Parameters: %{"demand" => 1}
[debug] QUERY OK db=102.7ms idle=467.8ms
begin []
...
[info] [R/T] ❯ Executing expression (1 operations)
[info] [R/T] ❯ Starting operation 1
[info] [R/T] ℹ Operation 1 complete in 0ms
[info] [R/T] ❯ Expression complete!
[info] [R/T] ❯ {
[info]   "configuration": {},
[info]   "data": {}
[info] }
[info] [R/T] ✔ Completed job f2367ddd-ab0d-4a2d-9a6a-51eb91f3313c in 352ms
[info] [R/T] ♦ Starting job 7ea4acab-96f4-4b32-bc2f-896ee4a961e3
[info] [R/T] ❯ Intialising pipeline
[info] [R/T] ❯ Timeout set to 300000ms
[info] [R/T] ❯ [linker] loading module @openfn/language-http
[info] [R/T] ❯ [linker] Loading module @openfn/language-http from /tmp/openfn/repo/node_modules/@openfn/language-http_latest/dist/index.cjs
[info] [R/T] ℹ Resolved adaptor @openfn/language-http to version 5.0.4
[debug] QUERY OK db=70.2ms
commit []
[debug] HANDLED claim INCOMING ON worker:queue (LightningWeb.WorkerChannel) in 187ms
  Parameters: %{"demand" => 1}
...
begin []
[debug] QUERY OK db=0.5ms
...
[debug] QUERY OK db=0.2ms
...
[debug] QUERY OK db=0.3ms
commit []
[debug] HANDLED run:start INCOMING ON attempt:02bcee87-fa8d-4dae-bc6a-b1c7538c237a (LightningWeb.AttemptChannel) in 22ms
  Parameters: %{"input_dataclip_id" => "0e189878-95b3-4ec1-b294-0a7f36feb9c8", "job_id" => "f2367ddd-ab0d-4a2d-9a6a-51eb91f3313c", "run_id" => "82301f39-c9e4-4204-aa52-adb377b12bcc"}
[info] [SRV] ℹ 02bcee87-fa8d-4dae-bc6a-b1c7538c237a :: run:start :: OK
[debug] QUERY OK db=0.1ms idle=654.2ms
begin []
[debug] QUERY OK source="runs" db=0.1ms
...
[debug] HANDLED run:complete INCOMING ON attempt:02bcee87-fa8d-4dae-bc6a-b1c7538c237a (LightningWeb.AttemptChannel) in 2ms
  Parameters: %{"job_id" => "f2367ddd-ab0d-4a2d-9a6a-51eb91f3313c", "output_dataclip" => "{\"configuration\":{},\"data\":{}}", "output_dataclip_id" => "282edadf-9789-4e99-9853-c4c45812257a", "reason" => "success", "run_id" => "82301f39-c9e4-4204-aa52-adb377b12bcc"}
[info] [SRV] ℹ 02bcee87-fa8d-4dae-bc6a-b1c7538c237a :: run:complete :: OK
[info] [R/T] ❯ Executing expression (1 operations)
[info] [R/T] ❯ Starting operation 1
[debug] QUERY OK db=1.1ms idle=794.2ms
begin []
[debug] QUERY OK source="attempts" db=0.4ms
...
commit []
[debug] HANDLED claim INCOMING ON worker:queue (LightningWeb.WorkerChannel) in 2ms
  Parameters: %{"demand" => 1}
[debug] QUERY OK source="attempts" db=51.1ms idle=756.5ms
...
commit []
[debug] HANDLED run:start INCOMING ON attempt:02bcee87-fa8d-4dae-bc6a-b1c7538c237a (LightningWeb.AttemptChannel) in 54ms
  Parameters: %{"input_dataclip_id" => "282edadf-9789-4e99-9853-c4c45812257a", "job_id" => "7ea4acab-96f4-4b32-bc2f-896ee4a961e3", "run_id" => "4dc2a437-d345-444a-b0ba-9aee0f9598d0"}
[info] [SRV] ℹ 02bcee87-fa8d-4dae-bc6a-b1c7538c237a :: run:start :: OK
[info] [R/T] ✘ Failed job 7ea4acab-96f4-4b32-bc2f-896ee4a961e3 after 251ms
[info] [R/T] ✘ ERR_BAD_REQUEST: Request failed with status code 405
[info] [R/T] ❯ {
[info]   "message": "Request failed with status code 405",
[info]   "name": "AxiosError",
[info]   "stack": "AxiosError: Request failed with status code 405\n    at settle (/private/tmp/openfn/repo/node_modules/axios/dist/node/axios.cjs:1268:12)\n    at IncomingMessage.handleStreamEnd (/private/tmp/openfn/repo/node_modules/axios/dist/node/axios.cjs:2446:11)\n    at IncomingMessage.emit (node:events:527:35)\n    at endReadableNT (node:internal/streams/readable:1589:12)\n    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)",
[info]   "config": {
[info]     "transitional": {
[info]       "silentJSONParsing": true,
[info]       "forcedJSONParsing": true,
[info]       "clarifyTimeoutError": false
[info]     },
[info]     "transformRequest": [
[info]       null
[info]     ],
[info]     "transformResponse": [
[info]       null
[info]     ],
[info]     "timeout": 0,
[info]     "xsrfCookieName": "XSRF-TOKEN",
[info]     "xsrfHeaderName": "X-XSRF-TOKEN",
[info]     "maxContentLength": -1,
[info]     "maxBodyLength": -1,
[info]     "env": {},
[info]     "headers": {
[info]       "Accept": "application/json, text/plain, */*",
[info]       "Content-Type": "application/x-www-form-urlencoded",
[info]       "cookie": "",
[info]       "User-Agent": "axios/1.1.3",
[info]       "Accept-Encoding": "gzip, deflate, br"
[info]     },
iex(2)> DOMException [DataCloneError]: function httpAdapter(config) {
                                                                       return new Promise(function dispatchHttpRequest(resolvePromise, rejectPromise)...<omitted>...
                                                                                                                                                                   } could not be cloned.
                         at new DOMException (node:internal/per_context/domexception:53:5)
                                                                                              at worker.emit (/Users/taylor/build/Lightning/assets/node_modules/workerpool/src/worker.js:252:12)
                                at Object.workerEmit (/Users/taylor/build/Lightning/assets/node_modules/workerpool/src/index.js:31:10)
                                                                                                                                          at publish (file:///Users/taylor/build/Lightning/assets/node_modules/@openfn/engine-multi/dist/worker/worker.js:49:14)
                                                                                                at helper (file:///Users/taylor/build/Lightning/assets/node_modules/@openfn/engine-multi/dist/worker/worker.js:60:5)
                                                    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
                                                                                                                         [info]     "method": "post",
[info]     "responseType": "json",
[info]     "maxRedirects": 5,
[info]     "url": "https://docs.openfn.org/this-wont-work",
[info]     "params": {},
[info]     "data": null
[info]   },
[info]   "code": "ERR_BAD_REQUEST",
[info]   "status": 405
[info] }
[info] [R/T] ✘ Check state.errors.7ea4acab-96f4-4b32-bc2f-896ee4a961e3 for details.
[info] [RTE] ❯ Purging workers

I think this will best be solved with a pairing session. I'll run lightning and start the worker with a watch, you liveShare into the worker and we run through a few scenarios.

josephjclark commented 11 months ago

A change!! ok -> success

josephjclark commented 11 months ago

Not handling every case yet but I am tracking the most critical and the design is fundamentally done.

OpenFn / kit

Exit Reasons #396