RWTH-ACS / cricket

cricket is a virtualization solution for GPUs
MIT License
150 stars 39 forks source link

Pytorch not working with CUDA 11.2 and CUDA 11.7 #32

Open TC-MCZ opened 1 year ago

TC-MCZ commented 1 year ago
          Hi ,I have some problems when running cricket in pytorch. I have pulled the latest code,and build pytorch locally with modify change the doces mentioned.

my CUDA is 11.2 and cudnn is 8.9.2 in ths Tesla P4,but get this problem:

server: +08:01:00.423212 WARNING: duplicate resource! The first resource will be overwritten in resource-mg.c:145 +08:01:00.445168 WARNING: duplicate resource! The first resource will be overwritten in resource-mg.c:145 +08:01:00.445403 WARNING: duplicate resource! The first resource will be overwritten in resource-mg.c:145 +08:01:00.447247 WARNING: duplicate resource! The first resource will be overwritten in resource-mg.c:145 +08:01:00.448076 WARNING: duplicate resource! The first resource will be overwritten in resource-mg.c:145 +08:01:07.164339 ERROR: cuda_device_prop_result size mismatch in cpu-server-runtime.c:367 +08:02:22.370950 INFO: RPC deinit requested. +08:08:54.324012 INFO: have a nice day! client: `+08:00:36.417392 WARNING: could not find .nv.info section. This means this binary does not contain any kernels. in cpu-elf2.c:922 +08:00:36.418684 WARNING: could not find .nv.info section. This means this binary does not contain any kernels. in cpu-elf2.c:922 +08:00:36.420058 WARNING: could not find .nv.info section. This means this binary does not contain any kernels. in cpu-elf2.c:922 call failed: RPC: Timed out call failed: RPC: Timed out call failed: RPC: Timed out +08:02:01.851255 ERROR: something went wrong in cpu-client-runtime.c:444 Traceback (most recent call last): File "/root/anaconda3/envs/py3.8/lib/python3.8/site-packages/torch/cuda/init.py", line 242, in _lazy_init queued_call() File "/root/anaconda3/envs/py3.8/lib/python3.8/site-packages/torch/cuda/init.py", line 125, in _check_capability capability = get_device_capability(d) File "/root/anaconda3/envs/py3.8/lib/python3.8/site-packages/torch/cuda/init.py", line 357, in get_device_capability prop = get_device_properties(device) File "/root/anaconda3/envs/py3.8/lib/python3.8/site-packages/torch/cuda/init.py", line 375, in get_device_properties return _get_device_properties(device) # type: ignore[name-defined] RuntimeError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/lwh/cricket/tests/test_apps/pytorch_minimal.py", line 39, in x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype) File "/root/anaconda3/envs/py3.8/lib/python3.8/site-packages/torch/cuda/init.py", line 246, in _lazy_init raise DeferredCudaCallError(msg) from e torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error:

CUDA call was originally invoked at:

[' File "/home/lwh/cricket/tests/test_apps/pytorch_minimal.py", line 31, in \n import torch\n', ' File "", line 991, in _find_and_load\n', ' File "", line 975, in _find_and_load_unlocked\n', ' File "", line 671, in _load_unlocked\n', ' File "", line 843, in exec_module\n', ' File "", line 219, in _call_with_frames_removed\n', ' File "/root/anaconda3/envs/py3.8/lib/python3.8/site-packages/torch/init.py", line 798, in \n _C._initExtension(manager_path())\n', ' File "", line 991, in _find_and_load\n', ' File "", line 975, in _find_and_load_unlocked\n', ' File "", line 671, in _load_unlocked\n', ' File "", line 843, in exec_module\n', ' File "", line 219, in _call_with_frames_removed\n', ' File "/root/anaconda3/envs/py3.8/lib/python3.8/site-packages/torch/cuda/init.py", line 179, in \n _lazy_call(_check_capability)\n', ' File "/root/anaconda3/envs/py3.8/lib/python3.8/site-packages/torch/cuda/init.py", line 177, in _lazy_call\n _queued_calls.append((callable, traceback.format_stack()))\n'] +08:02:27.007890 ERROR: call failed. in cpu-client.c:213 +08:02:27.012036 INFO: api-call-cnt: 14 +08:02:27.012051 INFO: memcpy-cnt: 0`

Is my CUDA version wrong? or other reasons?

Originally posted by @Tlhaoge in https://github.com/RWTH-ACS/cricket/issues/6#issuecomment-1654899151

leonardosul commented 9 months ago

Encountering the same issue. Using CUDA 11.7 and CUDNN 8.7.0. Running on an AWS EC2 instance.

It would be really nice to have a Github workflow that builds and runs this the RPC server and docker container together to ensure that it works as described in the docs. Although this would require a GPU enabled runner... probably not as easy as I imagined 🤔

n-eiling commented 9 months ago

There is a CI testing Cricket with a GPU enabled runner. There is no test for pytorch, yet, and yes, we should add one. However, I'm not surprised there are issues with pytorch support. Pytorch is really complex and uses a lot of CUDA features in unusual ways that make testing pretty difficult.

leonardosul commented 9 months ago

@n-eiling Thanks for the reply! I can see that you use Gitlab CI. I can have a look and see if I can write a workflow that can test pytorch with cricket.

Outside of that how would you recommend I go about trying to map the unusual ways that pytorch uses cuda? That might be a good place to start I guess.

leeyiding commented 5 months ago

Hello, I encountered the same problem when running pytorch_minimal.py on cuda11.8 and cndnn8.9. Does anyone have a solution now?