bigscience-workshop / petals

🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading
https://petals.dev
MIT License
8.89k stars 490 forks source link

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument tensors in method wrapper_CUDA_cat) #575

Closed jmikedupont2 closed 2 months ago

jmikedupont2 commented 2 months ago

The seq_len argument is deprecated and unused. It will be removed in v4.39. Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/mnt/data1/nix/time/2023/09/22/petals/src/petals/cli/run_server.py", line 235, in main() File "/mnt/data1/nix/time/2023/09/22/petals/src/petals/cli/run_server.py", line 219, in main server = Server( File "/mnt/data1/nix/time/2023/09/22/petals/src/petals/server/server.py", line 237, in init throughput_info = get_server_throughput( File "/mnt/data1/nix/time/2023/09/22/petals/src/petals/server/throughput.py", line 83, in get_server_throughput cache[cache_key] = measure_throughput_info( File "/mnt/data1/nix/time/2023/09/22/petals/src/petals/server/throughput.py", line 123, in measure_throughput_info "inference_rps": measure_compute_rps( File "/mnt/data1/nix/time/2023/09/22/petals/src/petals/server/throughput.py", line 218, in measure_compute_rps cache = step(cache) File "/mnt/data1/nix/time/2023/09/22/petals/src/petals/server/throughput.py", line 215, in step outputs = block.forward(dummy_input, use_cache=inference, layerpast=cache if inference else None) File "/mnt/data1/nix/time/2023/09/22/petals/.venv-omain/lib/python3.10/site-packages/tensor_parallel/tensor_parallel.py", line 99, in forward return [self.module_shards[0](*args, kwargs)][self.output_device_index] File "/mnt/data1/nix/time/2023/09/22/petals/.venv-omain/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/mnt/data1/nix/time/2023/09/22/petals/.venv-omain/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/mnt/data1/nix/time/2023/09/22/petals/src/petals/models/llama/block.py", line 264, in forward outputs = super().forward( File "/mnt/data1/nix/time/2023/09/22/petals/src/petals/models/llama/block.py", line 193, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/mnt/data1/nix/time/2023/09/22/petals/.venv-omain/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/mnt/data1/nix/time/2023/09/22/petals/.venv-omain/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, **kwargs) File "/mnt/data1/nix/time/2023/09/22/petals/src/petals/models/llama/block.py", line 103, in forward key_states = torch.cat([past_key_value[0], key_states], dim=2) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument tensors in method wrapper_CUDA_cat)

igorrafael commented 2 months ago

I have also encountered this issue. I was using Docker Desktop (CE) on Windows 11. I validated my setup first by running ollama/ollama successfully.

jmikedupont2 commented 2 months ago

ok i fixed this in my branch rolling back the version. https://github.com/meta-introspector/petals/commit/64e1361fa36648b1412c4557ddc5af3d6879a007

the latest version of petals does not work for me