I've encountered an issue when attempting to use the outlines library with a model downloaded from the Hugging Face Hub (turboderp/Llama-3-8B-Instruct-exl2) and specifying the CUDA device to use. Despite setting the model to run on device=1, it seems that some operations are still trying to access tensors on device=0, leading to a runtime error.
Steps/code to reproduce the bug:
import outlines
from huggingface_hub import snapshot_download
model_name="turboderp/Llama-3-8B-Instruct-exl2"
revision="3.0bpw"
model_directory = snapshot_download(repo_id=model_name, revision=revision, local_dir="llama3")
model = outlines.models.exl2(model_directory,device=1)
react_prompt = """
Question: How do you cook a sunny side egg?
FORMAT:
Strictly use the following format:
Thought: [insert thought]
Action: [Steps to follow]"""
generator = outlines.generate.text(model)
output = generator(react_prompt, stop_at="Action: ")
print(output)
Expected result:
Generation stops at "Action:" without RuntimeError
Error message:
RuntimeError Traceback (most recent call last)
Cell In[6], line 17
10 react_prompt = """
11 Question: How do you cook a sunny side egg?
12 FORMAT:
13 Strictly use the following format:
14 Thought: [insert thought]
15 Action: [Steps to follow]"""
16 generator = outlines.generate.text(model)
---> 17 output = generator(react_prompt, stop_at="Action: ")
18 print(output)
File ~/anaconda3/envs/ve-m/lib/python3.10/site-packages/outlines/generate/api.py:207, in SequenceGenerator.__call__(self, prompts, max_tokens, stop_at, rng)
205 while True:
206 try:
--> 207 last_state = next(states)
208 if max_tokens or stop_sequences:
209 token_ids = last_state.token_ids
File ~/anaconda3/envs/ve-m/lib/python3.10/site-packages/outlines/generate/generator.py:82, in sequence_generator(model, sampler, fsms, token_ids, sequence_weights, attention_masks, fsm_states, rng)
80 allowed_tokens = get_allowed_tokens(fsms, fsm_states)
81 biased_logits = bias_logits(logits, allowed_tokens)
---> 82 next_token_ids, ancestors, sequence_weights = sampler(
83 biased_logits, sequence_weights, rng
84 )
86 token_ids = update_token_ids(token_ids, next_token_ids, ancestors)
87 attention_masks = update_attention_masks(attention_masks, ancestors)
File ~/anaconda3/envs/ve-m/lib/python3.10/site-packages/outlines/samplers.py:160, in MultinomialSampler.__call__(self, next_token_logits, sequence_weights, rng)
156 logprobs = torch.nn.functional.log_softmax(altered_next_token_logits, dim=-1)
157 ancestors = torch.arange(
158 altered_next_token_logits.shape[0], device=next_token_logits.device
159 )
--> 160 weights = sequence_weights + torch.gather(logprobs, 1, next_token_ids).squeeze()
162 return next_token_ids, ancestors, weights
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!
Describe the issue as clearly as possible:
I've encountered an issue when attempting to use the outlines library with a model downloaded from the Hugging Face Hub (turboderp/Llama-3-8B-Instruct-exl2) and specifying the CUDA device to use. Despite setting the model to run on device=1, it seems that some operations are still trying to access tensors on device=0, leading to a runtime error.
Steps/code to reproduce the bug:
Expected result:
Error message:
Outlines/Python version information:
Version information
python -c "from outlines import _version; print(_version.version)" 0.0.43.dev11+g78852b0
python -c "import sys; print('Python', sys.version)" Python 3.10.13 (main, Sep 11 2023, 13:21:10) [GCC 11.2.0]
Context for the issue:
I want to utilize other GPUs in the server instead of 0, be able to specify the GPU