The async_job_call function raises a runtime error on 500 responses. This means that when we call a job that is temporarily down, we get an error. This is not the behavior we are after. In that case we would like the call to be retried, since the case will often be that the call will succeed if the job is called again, since the problem is temporary dowtime
Bunch of logs from Pub service:
[2024-04-30T08:01:33+0000] [32mINFO[0m Request: new Async Job Call [32mmethod[0m=POST [32mpath[0m=/pub/async/new/job/*/latest/api/v1/perform [32mrequestId[0m=* [2024-04-30T08:01:34+0000] [32mINFO[0m Async Job Call task created [32mcaller[0m="Job family *" [32mjobName[0m=* [32mjobPath[0m=/api/v1/perform [32mjobVersion[0m=1.3.15 [32mrequestId[0m=* [32mtaskId[0m=* [2024-04-30T08:01:34+0000] [31mEROR[0m Async Job Call request error [31merror[0m="making request to a job: Post \"http://job-*-v-*.ikp-rt.svc:7000/pub/job/*/1.3.15/api/v1/perform\": dial tcp 10.233.38.141:7000: connect: connection refused" [31mhost[0m=job-*-v-1-3-15.ikp-rt.svc:7000 [31mjobName[0m=* [31mjobVersion[0m=1.3.15 [31mpath[0m=/pub/job/*/1.3.15/api/v1/perform [31mrequestId[0m=4dad5fc5-370c-43d6-b500-6fe972e8adb8 [31mtaskId[0m=3725cdeb-a496-414b-bdc1-e5985eae2c2d [2024-04-30T08:01:34+0000] [32mINFO[0m Request: Poll async task [32mmethod[0m=GET [32mpath[0m=/pub/async/task/3725cdeb-a496-414b-bdc1-e5985eae2c2d/poll [32mrequestId[0m=29720a11-da1d-4242-a03a-6ddb396e56b7 [32mtaskId[0m=3725cdeb-a496-414b-bdc1-e5985eae2c2d [2024-04-30T08:01:34+0000] [32mINFO[0m Proxy request done [32mcaller[0m="User *" [32mjobName[0m=* [32mjobPath[0m=/pub/job/*/1.2.22-alpha/api/v1/perform [32mjobVersion[0m=1.2.22-alpha [32mrequestId[0m=4dad5fc5-370c-43d6-b500-6fe972e8adb8 [32mstatus[0m=500
It looks like the job was dead when making a call. And that's why it wasn't retried (It's retried when a pod dies during the request). It should be changed in Racetrack.
[2024-04-30 08:01:38] INFO Running ASGI server on http://0.0.0.0:7000
[2024-04-30 08:01:38] DEBUG Activated Job's venv: /src/job-venv/lib/python3.8/site-packages
[2024-04-30 08:01:38] INFO uvicorn.error: Started server process [1]
[2024-04-30 08:01:38] INFO uvicorn.error: Waiting for application startup.
[2024-04-30 08:01:38] INFO uvicorn.error: Application startup complete.
[2024-04-30 08:01:38] INFO uvicorn.error: Uvicorn running on http://0.0.0.0:7000 (Press CTRL+C to quit)
[2024-04-30 08:01:40] INFO loaded job class: *
/src/job-venv/lib/python3.8/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
{"time": "2024-04-30T08:01:42.383482Z", "level": "INFO", "message": "Server is ready", "name": "racetrack.racetrack_job_wrapper.server", "levelno": 20, "pathname": "/src/python_wrapper/racetrack_job_wrapper/server.py", "filename": "server.py", "module": "server", "lineno": 62, "funcName": "_late_init", "msecs": 383.4824562072754, "relativeCreated": 5048.620223999023, "thread": 139841959876288, "threadName": "Thread-1", "processName": "MainProcess", "process": 1, "*": "*", "job_version": "1.3.15"}
The async_job_call function raises a runtime error on 500 responses. This means that when we call a job that is temporarily down, we get an error. This is not the behavior we are after. In that case we would like the call to be retried, since the case will often be that the call will succeed if the job is called again, since the problem is temporary dowtime
Bunch of logs from Pub service:
[2024-04-30T08:01:33+0000] [32mINFO[0m Request: new Async Job Call [32mmethod[0m=POST [32mpath[0m=/pub/async/new/job/*/latest/api/v1/perform [32mrequestId[0m=* [2024-04-30T08:01:34+0000] [32mINFO[0m Async Job Call task created [32mcaller[0m="Job family *" [32mjobName[0m=* [32mjobPath[0m=/api/v1/perform [32mjobVersion[0m=1.3.15 [32mrequestId[0m=* [32mtaskId[0m=* [2024-04-30T08:01:34+0000] [31mEROR[0m Async Job Call request error [31merror[0m="making request to a job: Post \"http://job-*-v-*.ikp-rt.svc:7000/pub/job/*/1.3.15/api/v1/perform\": dial tcp 10.233.38.141:7000: connect: connection refused" [31mhost[0m=job-*-v-1-3-15.ikp-rt.svc:7000 [31mjobName[0m=* [31mjobVersion[0m=1.3.15 [31mpath[0m=/pub/job/*/1.3.15/api/v1/perform [31mrequestId[0m=4dad5fc5-370c-43d6-b500-6fe972e8adb8 [31mtaskId[0m=3725cdeb-a496-414b-bdc1-e5985eae2c2d [2024-04-30T08:01:34+0000] [32mINFO[0m Request: Poll async task [32mmethod[0m=GET [32mpath[0m=/pub/async/task/3725cdeb-a496-414b-bdc1-e5985eae2c2d/poll [32mrequestId[0m=29720a11-da1d-4242-a03a-6ddb396e56b7 [32mtaskId[0m=3725cdeb-a496-414b-bdc1-e5985eae2c2d [2024-04-30T08:01:34+0000] [32mINFO[0m Proxy request done [32mcaller[0m="User *" [32mjobName[0m=* [32mjobPath[0m=/pub/job/*/1.2.22-alpha/api/v1/perform [32mjobVersion[0m=1.2.22-alpha [32mrequestId[0m=4dad5fc5-370c-43d6-b500-6fe972e8adb8 [32mstatus[0m=500
It looks like the job was dead when making a call. And that's why it wasn't retried (It's retried when a pod dies during the request). It should be changed in Racetrack.
Logs from job-*-v-1-3-15-fd8b8bd67-wfvww pod:
then it restarted: