dottxt-ai / outlines

Structured Text Generation
https://dottxt-ai.github.io/outlines/
Apache License 2.0
9.85k stars 503 forks source link

"Expected all tensors to be on the same device, but found at least two devices" when using different threads #679

Closed amit13k closed 9 months ago

amit13k commented 9 months ago

Describe the issue as clearly as possible:

I was attempting to expose outlines features with a Flask server and encountered the following error:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

Upon debugging, it seems this error arises if I create generator = outlines.generate.text(model) in a different thread than where the model is instantiated (which is what happens in Flask). When I run all the code in the same thread, it works. Also in Flask, I am acquiring a lock before calling any of the outlines functions.

Additionally, this error doesn't seem to occur when using smaller models that fit completely on 1 GPU. I wish it were possible to work with larger models. Any suggestions to fix this issue would be appreciated. Thanks.

Steps/code to reproduce the bug:

import outlines
import threading

model_path = "huggingface/MaziyarPanahi_miqu-1-70b-sf-GPTQ" # https://huggingface.co/MaziyarPanahi/miqu-1-70b-sf-GPTQ

model = outlines.models.exl2(model_name=model_path, model_kwargs={
    "num_experts_per_token": 1,
    "gpu_split": "18,24",
}, device="cuda")

def task():
    generator = outlines.generate.text(model)
    output = generator("What is gravity?")
    print(output)

thread = threading.Thread(target=task)
thread.start()

Expected result:

The program should generate some output from the llm without crashing.

Error message:

Exception in thread Thread-1 (task):
Traceback (most recent call last):
  File "/home/amit/miniconda3/envs/outlines/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/home/amit/miniconda3/envs/outlines/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/home/amit/repos/outlines/main.py", line 13, in task
    output = generator("What is gravity?")
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/amit/miniconda3/envs/outlines/lib/python3.12/site-packages/outlines/generate/api.py", line 200, in __call__
    last_state = next(states)
                 ^^^^^^^^^^^^
  File "/home/amit/miniconda3/envs/outlines/lib/python3.12/site-packages/outlines/generate/generator.py", line 79, in sequence_generator
    next_token_ids, ancestors, sequence_weights = sampler(
                                                  ^^^^^^^^
  File "/home/amit/miniconda3/envs/outlines/lib/python3.12/site-packages/outlines/samplers.py", line 152, in __call__
    weights = sequence_weights + torch.gather(logprobs, 1, next_token_ids).squeeze()
              ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

Outlines/Python version information:

Version information

``` 0.1.dev500+ge99d92d Python 3.12.1 | packaged by Anaconda, Inc. | (main, Jan 19 2024, 15:51:05) [GCC 11.2.0] accelerate==0.27.0 aiohttp==3.9.3 aiosignal==1.3.1 annotated-types==0.6.0 attrs==23.2.0 auto-gptq @ git+https://github.com/PanQiWei/AutoGPTQ.git@323950bcb14059d7154109ee8b189f16cfc925d3 blinker==1.7.0 certifi==2024.2.2 chardet==5.2.0 charset-normalizer==3.3.2 click==8.1.7 cloudpickle==3.0.0 cramjam==2.8.1 datasets==2.17.0 dill==0.3.8 diskcache==5.6.3 einops==0.7.0 exllamav2 @ file:///home/amit/repos/outlines/exllamav2 expiring-dict==1.1.0 fastparquet==2024.2.0 filelock==3.13.1 flash-attn==2.5.3 Flask==3.0.2 Flask-Pydantic==0.12.0 frozenlist==1.4.1 fsspec==2024.2.0 gekko==1.0.6 huggingface-hub==0.20.3 idna==3.6 interegular==0.3.3 itsdangerous==2.1.2 Jinja2==3.1.3 joblib==1.3.2 jsonschema==4.21.1 jsonschema-specifications==2023.12.1 lark==1.1.9 llvmlite==0.42.0 MarkupSafe==2.1.5 mkl-fft @ file:///work/perseverance-python-buildout/croot/mkl_fft_1698845673361/work mkl-random @ file:///work/perseverance-python-buildout/croot/mkl_random_1698845720894/work mkl-service==2.4.0 mpmath==1.3.0 multidict==6.0.5 multiprocess==0.70.16 nest-asyncio==1.6.0 networkx==3.2.1 ninja==1.11.1.1 numba==0.59.0 numpy==1.26.4 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-nccl-cu12==2.19.3 nvidia-nvjitlink-cu12==12.3.101 nvidia-nvtx-cu12==12.1.105 outlines @ git+https://github.com/lapp0/outlines@e99d92d024dbf6f6bff10a9c3954f326cf4a0cd3 packaging==23.2 pandas==2.2.0 peft==0.8.2 pillow @ file:///croot/pillow_1707233021655/work psutil==5.9.8 pyarrow==15.0.0 pyarrow-hotfix==0.6 pydantic==2.6.1 pydantic_core==2.16.2 Pygments==2.17.2 python-dateutil==2.8.2 pytz==2024.1 PyYAML==6.0.1 referencing==0.33.0 regex==2023.12.25 requests==2.31.0 rouge==1.0.1 rpds-py==0.18.0 safetensors==0.4.2 scipy==1.12.0 sentencepiece==0.1.99 setuptools==68.2.2 six==1.16.0 sortedcontainers==2.4.0 sympy==1.12 tokenizers==0.15.2 torch==2.2.0 torchaudio==2.2.0 torchvision==0.17.0 tqdm==4.66.2 transformers==4.37.2 triton==2.2.0 typing_extensions==4.9.0 tzdata==2023.4 urllib3==2.2.0 websockets==12.0 Werkzeug==3.0.1 wheel==0.41.2 xxhash==3.4.1 yarl==1.9.4 ```

Context for the issue:

No response

lapp0 commented 9 months ago

Related: https://github.com/outlines-dev/outlines/issues/656

Looks like you're on an outdated version of outlines: outlines@git+https://github.com/lapp0/outlines@e99d92d

Could you try pip install --upgrade outlines?

amit13k commented 9 months ago

Related: #656

Looks like you're on an outdated version of outlines: outlines@git+https://github.com/lapp0/outlines@e99d92d

Could you try pip install --upgrade outlines?

Thanks for the reply. I uninstalled outlines and ran pip install --upgrade outlines (which installed version 0.0.32), and the issue still exists. Creating and using the generator in a different thread leads to the same error.

lapp0 commented 9 months ago

Thanks for checking, sorry it's not still working. I'll take a look soon.

lapp0 commented 9 months ago

Could you please try using a fresh venv? I cannot reproduce and it appears you're using a local exllamav2 (exllamav2 @ file:///home/amit/repos/outlines/exllamav2)

If the issue persists with a fresh venv, please provide your new pip3 freeze, along with nvidia-smi.

Script:

import outlines
import threading

# git lfs install
# git clone https://huggingface.co/MaziyarPanahi/miqu-1-70b-sf-GPTQ models/llm/miqu

model = outlines.models.exl2(model_name="models/llm/miqu", model_kwargs={
    "num_experts_per_token": 1,
    "gpu_split": "18,24",
}, device="cuda")

def task():
    s_greedy = outlines.samplers.greedy()
    s_multinomial = outlines.samplers.multinomial()

    for sampler in [s_greedy, s_multinomial]:
        print("generating with", sampler)
        generator = outlines.generate.text(model, sampler)
        output = generator("What is gravity?")
        print(output)

thread = threading.Thread(target=task)
thread.start()

Output:

generating with <outlines.samplers.GreedySampler object at 0x7f289c11c640>

Gravity is a force that pulls two objects towards each other. It is a fundamental force of nature, which means that it is one of the basic forces that govern the behavior of matter and energy in the universe. Gravity is what keeps planets in orbit around stars, and it is what keeps objects on the surface of a planet.

Gravity is described by the theory of general relativity, which was developed by Albert Einstein in 1915. According to this theory, gravity is not a force that acts at a distance, but rather a curvature of space-time caused by the presence of mass or energy. This means that objects move in response to the curvature of space-time, rather than being pulled by a force.

The strength of gravity between two objects depends on their masses and the distance between them. The greater the masses of the objects, the stronger the gravitational force between them. The closer the objects are to each other, the stronger the gravitational force.

Gravity is a very weak force compared to other fundamental forces, such as the strong and weak nuclear forces that govern the behavior of subatomic particles. However, because of the large masses of objects in the universe, gravity has a very significant effect on the motion of celestial bodies and the structure of the universe as a whole.
generating with <outlines.samplers.MultinomialSampler object at 0x7f289c11cac0>
How does it affect motion in our solar system and beyond? Let's explore an intriguing relationship between mass, energy, and gravitation.

Gravity is a fundamental force of nature that describes the attractive interaction between mass or energy. It's responsible for holding planets in orbit around stars, maintaining structures like galaxies, and even keeping our feet on the ground.

The universal law of gravitation, proposed by Sir Isaac Newton in 1687, explains how the strength of gravitational attraction between two objects depends on their masses and the distance between them:

F = G * (m1 * m2) / r_

Where F is the force of gravity, G is the gravitational constant, m1 and m2 are the masses of the two objects, and r is the distance between them.

In our solar system, this law explains why planets move in elliptical orbits around the Sun. The Sun, being much more massive than any planet, exerts a dominant gravitational pull that keeps the planets in their orbits. This idea was further refined by Albert Einstein's theory of General Relativity, which describes gravity as a curvature of spacetime caused by mass and energy.

Outside our solar system, gravity plays a crucial role in the formation and evolution of stars, galaxies, and even cosmic structures like dark matter halos. Stars, formed from collapsing clouds of gas and dust, burn through their nuclear fuel, eventually exploding as supernovae, releasing vast amounts of energy and creating new elements. These stars then form binary systems, units of two stars orbiting each other, or become part of larger systems where their gravity influences the motion of planets and other celestial bodies.

Galaxies, which contain millions or billions of stars, are held together by their mutual gravitational attraction. The presence of dark matter, an unknown substance that does not emit light but has gravitational effects, is believed to provide the necessary mass to keep galaxies stable against the outwards forces of their own stars.

Gravity's influence extends far beyond our solar system, playing a role in the large-scale structure of the universe. Studies suggest that more than 95% of the universe's matter is composed of dark matter and dark energy. While the nature of these mysterious substances remains unknown, our understanding of gravity continues to be a cornerstone in explaining their effects on the cosmic scale.

Gravity, an intriguing force that has shaped our understanding of the universe, continues to be a topic of fascination and research for scientists, astronomers, and curious individuals alike. As we continue to probe deeper into the secrets of the universe, our knowledge of gravity will undoubtedly evolve, providing new insights into the fundamental structure of reality.
amit13k commented 9 months ago

Thanks for trying to reproduce the issue. I tried to log the device info before the line where the error is happening in samplers.py.

print(f"sequence_weights.device: {sequence_weights.device}, logprobs.device: {logprobs.device}, next_token_ids.device: {logprobs.device}")
weights = sequence_weights + torch.gather(logprobs, 1, next_token_ids).squeeze()

When trying to run the generation in a new thread, the log indicatd that sequence_weights was on cuda:0, sequence_weights.device: cuda:0, logprobs.device: cuda:1, next_token_ids.device: cuda:1

and when running in the main thread or using device="cuda:1", everything was on cuda:1, sequence_weights.device: cuda:1, logprobs.device: cuda:1, next_token_ids.device: cuda:1

Wondering why creating a new thread changes the device for sequence_weights. However, I was able to fix the issue by changing device="cuda" to device="cuda:1" when creating the model.

pip3 freeze ``` annotated-types==0.6.0 attrs==23.2.0 Brotli @ file:///work/perseverance-python-buildout/croot/brotli-split_1698805593785/work certifi @ file:///croot/certifi_1707229174982/work/certifi chardet==5.2.0 charset-normalizer @ file:///tmp/build/80754af9/charset-normalizer_1630003229654/work cloudpickle==3.0.0 cramjam==2.8.1 diskcache==5.6.3 einops==0.7.0 exllamav2 @ git+https://github.com/turboderp/exllamav2@825929af7d9091983ad8524a9a7b522a8c620473 fastparquet==2024.2.0 filelock==3.13.1 flash-attn==2.5.3 fsspec==2024.2.0 huggingface-hub==0.20.3 idna==3.6 interegular==0.3.3 Jinja2==3.1.3 joblib==1.3.2 jsonschema==4.21.1 jsonschema-specifications==2023.12.1 lark==1.1.9 llvmlite==0.42.0 MarkupSafe==2.1.5 mkl-fft @ file:///work/perseverance-python-buildout/croot/mkl_fft_1698845673361/work mkl-random @ file:///work/perseverance-python-buildout/croot/mkl_random_1698845720894/work mkl-service==2.4.0 mpmath==1.3.0 nest-asyncio==1.6.0 networkx==3.2.1 ninja==1.11.1.1 numba==0.59.0 numpy @ file:///croot/numpy_and_numpy_base_1704311704800/work/dist/numpy-1.26.3-cp312-cp312-linux_x86_64.whl#sha256=71892d12f82a9c47262bffc99a1edd8ebc0b3d1e033366094bd45a8b4c7c2d43 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-nccl-cu12==2.19.3 nvidia-nvjitlink-cu12==12.3.101 nvidia-nvtx-cu12==12.1.105 outlines==0.0.32 packaging==23.2 pandas==2.2.0 pillow @ file:///croot/pillow_1707233021655/work pydantic==2.6.1 pydantic_core==2.16.2 Pygments==2.17.2 PySocks @ file:///work/perseverance-python-buildout/croot/pysocks_1698845478203/work python-dateutil==2.8.2 pytz==2024.1 PyYAML @ file:///work/perseverance-python-buildout/croot/pyyaml_1698849903511/work referencing==0.33.0 regex==2023.12.25 requests @ file:///croot/requests_1707355572290/work rpds-py==0.18.0 safetensors==0.4.2 scipy==1.12.0 sentencepiece==0.2.0 setuptools==68.2.2 six==1.16.0 sympy==1.12 tokenizers==0.15.2 torch==2.2.0 torchaudio==2.2.0 torchvision==0.17.0 tqdm==4.66.2 transformers==4.37.2 triton==2.2.0 typing_extensions==4.9.0 tzdata==2024.1 urllib3==2.2.1 websockets==12.0 wheel==0.41.2 ```
nvidia-smi ```ts Tue Feb 20 19:21:39 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3090 On | 00000000:10:00.0 Off | N/A | | 78% 64C P2 137W / 370W | 405MiB / 24576MiB | 3% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce RTX 4090 On | 00000000:25:00.0 Off | Off | | 31% 44C P8 26W / 450W | 16MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 1809 G /usr/lib/xorg/Xorg 4MiB | | 0 N/A N/A 1989 C+G ...libexec/gnome-remote-desktop-daemon 384MiB | | 1 N/A N/A 1809 G /usr/lib/xorg/Xorg 4MiB | +---------------------------------------------------------------------------------------+ ```
rlouf commented 9 months ago

Glad this workaround made it work! Would you mind printing the devices of weights, prompt_token_ids and attention_masks here? I suspect torch assigns devices more or less at random when only cuda is specified. In this case we will need to update the initialization code to make sure attention_masks and sequences_weights are on the same device as prompt_token_ids.

This is a small change, happy to review a PR if you feel like contributing.