TheRacetrack / racetrack

An opinionated framework for deploying, managing, and serving application workloads
https://theracetrack.github.io/racetrack/
Apache License 2.0
28 stars 5 forks source link

async_job_call failing on 500 #459

Closed iszulcdeepsense closed 1 month ago

iszulcdeepsense commented 1 month ago

The async_job_call function raises a runtime error on 500 responses. This means that when we call a job that is temporarily down, we get an error. This is not the behavior we are after. In that case we would like the call to be retried, since the case will often be that the call will succeed if the job is called again, since the problem is temporary dowtime

Bunch of logs from Pub service: [2024-04-30T08:01:33+0000] INFO Request: new Async Job Call method=POST path=/pub/async/new/job/*/latest/api/v1/perform requestId=* [2024-04-30T08:01:34+0000] INFO Async Job Call task created caller="Job family *" jobName=* jobPath=/api/v1/perform jobVersion=1.3.15 requestId=* taskId=* [2024-04-30T08:01:34+0000] EROR Async Job Call request error error="making request to a job: Post \"http://job-*-v-*.ikp-rt.svc:7000/pub/job/*/1.3.15/api/v1/perform\": dial tcp 10.233.38.141:7000: connect: connection refused" host=job-*-v-1-3-15.ikp-rt.svc:7000 jobName=* jobVersion=1.3.15 path=/pub/job/*/1.3.15/api/v1/perform requestId=4dad5fc5-370c-43d6-b500-6fe972e8adb8 taskId=3725cdeb-a496-414b-bdc1-e5985eae2c2d [2024-04-30T08:01:34+0000] INFO Request: Poll async task method=GET path=/pub/async/task/3725cdeb-a496-414b-bdc1-e5985eae2c2d/poll requestId=29720a11-da1d-4242-a03a-6ddb396e56b7 taskId=3725cdeb-a496-414b-bdc1-e5985eae2c2d [2024-04-30T08:01:34+0000] INFO Proxy request done caller="User *" jobName=* jobPath=/pub/job/*/1.2.22-alpha/api/v1/perform jobVersion=1.2.22-alpha requestId=4dad5fc5-370c-43d6-b500-6fe972e8adb8 status=500

It looks like the job was dead when making a call. And that's why it wasn't retried (It's retried when a pod dies during the request). It should be changed in Racetrack.

Logs from job-*-v-1-3-15-fd8b8bd67-wfvww pod:

{"time": "2024-04-30T08:01:18.934779Z", "level": "INFO", "message": "POST /pub/job/*/1.3.15/api/v1/perform 200", "name": "racetrack.racetrack_commons.api.asgi.access_log", "levelno": 20, "pathname": "/src/python_wrapper/racetrack_commons/api/asgi/access_log.py", "filename": "access_log.py", "module": "access_log", "lineno": 83, "funcName": "access_log", "msecs": 934.7789287567139, "relativeCreated": 15966.252565383911, "thread": 140555844549504, "threadName": "MainThread", "processName": "MainProcess", "process": 1, "tracing_id": "e9e58b7b-0a2e-499b-a5fd-3cd92135ab29", "*": "*", "job_version": "1.3.15"}

then it restarted:

[2024-04-30 08:01:38] INFO  Running ASGI server on http://0.0.0.0:7000
[2024-04-30 08:01:38] DEBUG Activated Job's venv: /src/job-venv/lib/python3.8/site-packages
[2024-04-30 08:01:38] INFO  uvicorn.error: Started server process [1]
[2024-04-30 08:01:38] INFO  uvicorn.error: Waiting for application startup.
[2024-04-30 08:01:38] INFO  uvicorn.error: Application startup complete.
[2024-04-30 08:01:38] INFO  uvicorn.error: Uvicorn running on http://0.0.0.0:7000 (Press CTRL+C to quit)
[2024-04-30 08:01:40] INFO  loaded job class: *
/src/job-venv/lib/python3.8/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
{"time": "2024-04-30T08:01:42.383482Z", "level": "INFO", "message": "Server is ready", "name": "racetrack.racetrack_job_wrapper.server", "levelno": 20, "pathname": "/src/python_wrapper/racetrack_job_wrapper/server.py", "filename": "server.py", "module": "server", "lineno": 62, "funcName": "_late_init", "msecs": 383.4824562072754, "relativeCreated": 5048.620223999023, "thread": 139841959876288, "threadName": "Thread-1", "processName": "MainProcess", "process": 1, "*": "*", "job_version": "1.3.15"}