JupyterLab/Local pipeline execution fails with "Expected token < in JSON ..." for long running tasks

ptitzler commented 3 years ago

Describe the issue See steps below. I was only able to always reproduce this on ODH, but not locally.

To Reproduce Steps to reproduce the behavior:

In ODH launch JupyterHub and choose in the spawner dialog the s2i-lab-elyra:vX.Y.Z image
Git clone https://github.com/elyra-ai/examples.git
Open the NOAA pipeline.
Run the pipeline locally.

There is no error logged in the Jupyter console. The web browser indicates that the POST https://jupyterhub-1-0-9-test.ptitzler-odh-442dbba0442be6c8c50f31ed96b00601-0000.sjc04.containers.appdomain.cloud/user/iam%23ptitzler@us.ibm.com/elyra/pipeline/schedule?1618515969761 request fails with status code 504, returning this message text:

<html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>`

which causes the JSON parser error.

Expected behavior A clear and concise description of what you expected to happen.

The front end should catch processing errors.
Not sure to what extend the front end can provide a meaningful message since it doesn't know whether the back-end sent "garbage", whether the backend timed out and the long running request is still processing.
Insights why this always happens on ODH/JupyterHub/JupyterLab but not locally. The "long running job" completed within a few minutes and didn't seem to be taking longer than usual.

Deployment information Describe what you've deployed and how:

Elyra version: 2.2.1
Installation source: official container image s2i-lab-elyra:v0.0.7
Deployment type: Open Data Hub

ptitzler commented 3 years ago

I was unable to reproduce the issue running that container image locally:

docker run -p 8888:8888 quay.io/thoth-station/s2i-lab-elyra:v0.0.7 jupyter lab --ip=0.0.0.0 --port=8888 --notebook-dir=/tmp

Perhaps JupyterHub is imposing some sort of timeout limit? Just a random guess ...

bourdakos1 commented 3 years ago

@ptitzler do you know how long it takes for the response to timeout? I'm guessing this is on the ingress level

akchinSTC commented 3 years ago

@bourdakos1 - 30 seconds on the dot

bourdakos1 commented 3 years ago

@akchinSTC this smells like a default kubernetes/openshift timeout that could probably be configured away, but we should probably find a different solution. I don’t think we should have a request open for longer than 30seconds. We should try to find a way to better handle long running submissions/runs

bourdakos1 commented 3 years ago

I think if we move to something like websockets we shouldn't run into a timeout issue and could get richer information about the progress of the runs

akchinSTC commented 3 years ago

the deployment on openshift doesn't appear to have any ingresses or routes configured in the odh project/namespace, so it's probably using some global default. Short term, going to see if a quick config change will do the trick, but agree with

We should try to find a way to better handle long running submissions/runs

ptitzler commented 3 years ago

From https://jupyterhub.readthedocs.io/en/stable/reference/config-reference.html

## Timeout (in seconds) before giving up on a spawned HTTP server
#  
#  Once a server has successfully been spawned, this is the amount of time we
#  wait before assuming that the server is unable to accept connections.
#  Default: 30
# c.Spawner.http_timeout = 30

From https://jupyterhub-kubespawner.readthedocs.io/en/latest/spawner.html

##  Timeout (in seconds) before giving up on a spawned HTTP server
# Once a server has successfully been spawned, this is the amount of time we wait before assuming that the server is unable to accept connections.
# c.KubeSpawner.http_timeout = Int(30)

ptitzler commented 3 years ago

Turns out the default timeout for routes happens to be 30 seconds as well and needs to be increased.

ptitzler commented 3 years ago

Development TODO: add error handling for non-success status codes.

usmcamp0811 commented 2 years ago

I am seeing this error when I attempt to run in our Kubernetes Z2JH environment. If run the container that has Elyra installed (same thats used in the k8s cluster) it works. I am seeing a 403 error in the browser console at the elyra/pipeline/schedule?1657211511533 endpoint.

elyra-ai / elyra

JupyterLab/Local pipeline execution fails with "Expected token < in JSON ..." for long running tasks #1573