Error trying to raise Mixtral private swarm server

Qessia commented 7 months ago

Reproduce:

python3 -m petals.cli.run_server mistralai/Mixtral-8x7B-v0.1 --new_swarm or python3 -m petals.cli.run_server SanjiWatsuki/TinyMixtral-32x248M --new_swarm

Got:

TypeError: WrappedMixtralBlock.__init__() missing 1 required positional argument: layer_idx

System:

Python 3.10.13
Cuda 12.2
torch 2.2.1, torchaudio 2.2.1, torchvision 0.17.1

artek0chumak commented 7 months ago

Hello! Thank you for reporting! We will quickly resolve this issue.

mprishchepo commented 7 months ago

Hello!

I observe the same problem. I have tried to diagnose the issue a bit by myselve.

As I understood (if you haven't found it already) the problem is in calculating block size (its parameters). The layer_idx mentioned above is used in load_pretrained_block, but it is not used when calculating block_size and when calculating rps in throughput.

Very much waiting for a solution.

artek0chumak commented 7 months ago

We resolved this issue in recent master update. Just pull new updates. Thank tou for noticing the issue and waiting fixes.

Qessia commented 7 months ago

Thank you for your quick response!

mprishchepo commented 7 months ago

Hi! Original error of this issue doesn't appear anymore, but I've got another error when I try launching private swarm with Mixtral (with GPU, CPU is ok). Also it doesn't appear when I do the same with StableBeluga2

System:

Python3.10
Torch2.2.2
Cuda 12.3

Ubuntu 22.04

Reproduce

python3 -m petals.cli.run_server SanjiWatsuki/TinyMixtral-32x248M --new_swarm

Error

File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/qessia/.local/lib/python3.10/site-packages/petals/cli/run_server.py", line 235, in <module>
main()
File "/home/qessia/.local/lib/python3.10/site-packages/petals/cli/run_server.py", line 219, in main
server = Server(
File "/home/qessia/.local/lib/python3.10/site-packages/petals/server/server.py", line 237, in __init__
throughput_info = get_server_throughput(
File "/home/qessia/.local/lib/python3.10/site-packages/petals/server/throughput.py", line 83, in get_server_throughput
cache[cache_key] = measure_throughput_info(
File "/home/qessia/.local/lib/python3.10/site-packages/petals/server/throughput.py", line 123, in measure_throughput_info
"inference_rps": measure_compute_rps(
File "/home/qessia/.local/lib/python3.10/site-packages/petals/server/throughput.py", line 218, in measure_compute_rps
cache = step(cache)
File "/home/qessia/.local/lib/python3.10/site-packages/petals/server/throughput.py", line 215, in step
outputs = block.forward(dummy_input, use_cache=inference, layer_past=cache_ if inference else None)
File "/home/qessia/.local/lib/python3.10/site-packages/tensor_parallel/tensor_parallel.py", line 99, in forward
return [self.module_shards[0](*args, **kwargs)][self.output_device_index]
File "/home/qessia/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/qessia/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/qessia/.local/lib/python3.10/site-packages/petals/models/mixtral/block.py", line 74, in forward
outputs = super().forward(
File "/home/qessia/.local/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 934, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/qessia/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/qessia/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/qessia/.local/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 356, in forward
key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
File "/home/qessia/.local/lib/python3.10/site-packages/transformers/cache_utils.py", line 131, in update
self.key_cache[layer_idx] = torch.cat([self.key_cache[layer_idx], key_states], dim=-2)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument tensors in method wrapper_CUDA_cat)

artek0chumak commented 7 months ago

Hello! This is a strange error. Can you also provide a transformers' version?

mprishchepo commented 7 months ago

Can you also provide a transformers' version?

4.38.2

artek0chumak commented 7 months ago

Thank you for the information. It seems the only change required is this: https://github.com/bigscience-workshop/petals/pull/574. We will soon merge it with the main.

mprishchepo commented 7 months ago

Hi! How is the work on the fixes going, is everything good? We are really looking for the merge

jmikedupont2 commented 7 months ago

I had that same error on master as well and had a ticket open for it, https://github.com/bigscience-workshop/petals/issues/575

artek0chumak commented 7 months ago

Sorry for taking so long; the fix is merged into the master.

jmikedupont2 commented 7 months ago

I was able to get the branch mentioned running and my docker work rebased.

Have now tinymixtral running locally in gpu. https://github.com/meta-introspector/petals

Qessia commented 7 months ago

Thank you for fixes!! It works

bigscience-workshop / petals