async_job_call failing on 500 - Githubissues

TheRacetrack / racetrack

An opinionated framework for deploying, managing, and serving application workloads

https://theracetrack.github.io/racetrack/

Apache License 2.0

28 stars 5 forks source link

async_job_call failing on 500 #459

Closed iszulcdeepsense closed 1 month ago

iszulcdeepsense commented 1 month ago

The async_job_call function raises a runtime error on 500 responses. This means that when we call a job that is temporarily down, we get an error. This is not the behavior we are after. In that case we would like the call to be retried, since the case will often be that the call will succeed if the job is called again, since the problem is temporary dowtime

Bunch of logs from Pub service: [2024-04-30T08:01:33+0000] [32mINFO[0m Request: new Async Job Call [32mmethod[0m=POST [32mpath[0m=/pub/async/new/job/*/latest/api/v1/perform [32mrequestId[0m=* [2024-04-30T08:01:34+0000] [32mINFO[0m Async Job Call task created [32mcaller[0m="Job family *" [32mjobName[0m=* [32mjobPath[0m=/api/v1/perform [32mjobVersion[0m=1.3.15 [32mrequestId[0m=* [32mtaskId[0m=* [2024-04-30T08:01:34+0000] [31mEROR[0m Async Job Call request error [31merror[0m="making request to a job: Post \"http://job-*-v-*.ikp-rt.svc:7000/pub/job/*/1.3.15/api/v1/perform\": dial tcp 10.233.38.141:7000: connect: connection refused" [31mhost[0m=job-*-v-1-3-15.ikp-rt.svc:7000 [31mjobName[0m=* [31mjobVersion[0m=1.3.15 [31mpath[0m=/pub/job/*/1.3.15/api/v1/perform [31mrequestId[0m=4dad5fc5-370c-43d6-b500-6fe972e8adb8 [31mtaskId[0m=3725cdeb-a496-414b-bdc1-e5985eae2c2d [2024-04-30T08:01:34+0000] [32mINFO[0m Request: Poll async task [32mmethod[0m=GET [32mpath[0m=/pub/async/task/3725cdeb-a496-414b-bdc1-e5985eae2c2d/poll [32mrequestId[0m=29720a11-da1d-4242-a03a-6ddb396e56b7 [32mtaskId[0m=3725cdeb-a496-414b-bdc1-e5985eae2c2d [2024-04-30T08:01:34+0000] [32mINFO[0m Proxy request done [32mcaller[0m="User *" [32mjobName[0m=* [32mjobPath[0m=/pub/job/*/1.2.22-alpha/api/v1/perform [32mjobVersion[0m=1.2.22-alpha [32mrequestId[0m=4dad5fc5-370c-43d6-b500-6fe972e8adb8 [32mstatus[0m=500

It looks like the job was dead when making a call. And that's why it wasn't retried (It's retried when a pod dies during the request). It should be changed in Racetrack.

Logs from job-*-v-1-3-15-fd8b8bd67-wfvww pod:

{"time": "2024-04-30T08:01:18.934779Z", "level": "INFO", "message": "POST /pub/job/*/1.3.15/api/v1/perform 200", "name": "racetrack.racetrack_commons.api.asgi.access_log", "levelno": 20, "pathname": "/src/python_wrapper/racetrack_commons/api/asgi/access_log.py", "filename": "access_log.py", "module": "access_log", "lineno": 83, "funcName": "access_log", "msecs": 934.7789287567139, "relativeCreated": 15966.252565383911, "thread": 140555844549504, "threadName": "MainThread", "processName": "MainProcess", "process": 1, "tracing_id": "e9e58b7b-0a2e-499b-a5fd-3cd92135ab29", "*": "*", "job_version": "1.3.15"}

then it restarted:

[2024-04-30 08:01:38] INFO  Running ASGI server on http://0.0.0.0:7000
[2024-04-30 08:01:38] DEBUG Activated Job's venv: /src/job-venv/lib/python3.8/site-packages
[2024-04-30 08:01:38] INFO  uvicorn.error: Started server process [1]
[2024-04-30 08:01:38] INFO  uvicorn.error: Waiting for application startup.
[2024-04-30 08:01:38] INFO  uvicorn.error: Application startup complete.
[2024-04-30 08:01:38] INFO  uvicorn.error: Uvicorn running on http://0.0.0.0:7000 (Press CTRL+C to quit)
[2024-04-30 08:01:40] INFO  loaded job class: *
/src/job-venv/lib/python3.8/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
{"time": "2024-04-30T08:01:42.383482Z", "level": "INFO", "message": "Server is ready", "name": "racetrack.racetrack_job_wrapper.server", "levelno": 20, "pathname": "/src/python_wrapper/racetrack_job_wrapper/server.py", "filename": "server.py", "module": "server", "lineno": 62, "funcName": "_late_init", "msecs": 383.4824562072754, "relativeCreated": 5048.620223999023, "thread": 139841959876288, "threadName": "Thread-1", "processName": "MainProcess", "process": 1, "*": "*", "job_version": "1.3.15"}