dottxt-ai / outlines

Structured Text Generation
https://dottxt-ai.github.io/outlines/
Apache License 2.0
9.91k stars 511 forks source link

Accelerate not installed in Docker Image #1106

Open aowen14 opened 3 months ago

aowen14 commented 3 months ago

Describe the issue as clearly as possible:

Not sure if this is a bug or a feature request. But accelerate apparently isn't installed in the docker image. Which means that one can either use transformers with no GPU acceleration, or vLLM. vLLM currently doesn't have feature parity with transformers from what I can tell (like generate.json()).

Running the code outside of the image with the library + accelerate works.pip install accelerate in the container works to solve the issue as well, and it seems like the marginal download was very small.

Steps/code to reproduce the bug:

#within an outlines docker container

from outlines import models

model = models.transformers("microsoft/Phi-3-mini-128k-instruct", device="cuda:0")

Expected result:

The model should load and then be usable with the outlines sdk, with transformers and sending the model to a gpu, like `"cuda:0"`

Error message:

Traceback (most recent call last):
  File "/outlines/Performance-Benchmarking/outlines_local_examples.py", line 81, in <module>
    model = models.transformers("microsoft/Phi-3-mini-128k-instruct", device="cuda:0")
  File "/usr/local/lib/python3.10/site-packages/outlines/models/transformers.py", line 430, in transformers
    model = model_class.from_pretrained(model_name, **model_kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
    return model_class.from_pretrained(
  File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3296, in from_pretrained
    raise ImportError(
ImportError: Using `low_cpu_mem_usage=True` or a `device_map` requires Accelerate: `pip install accelerate`

Outlines/Python version information:

Docker Image Version Hash: 98c8512bd46f

Version information

``` 0.1.dev1+g8e94488.d20240816 Python 3.10.14 (main, Aug 13 2024, 02:10:16) [GCC 12.2.0] aiohappyeyeballs==2.3.6 aiohttp==3.10.3 aiosignal==1.3.1 annotated-types==0.7.0 anyio==4.4.0 async-timeout==4.0.3 attrs==24.2.0 certifi==2024.7.4 charset-normalizer==3.3.2 click==8.1.7 cloudpickle==3.0.0 cmake==3.30.2 datasets==2.21.0 dill==0.3.8 diskcache==5.6.3 distro==1.9.0 exceptiongroup==1.2.2 fastapi==0.112.1 filelock==3.15.4 frozenlist==1.4.1 fsspec==2024.6.1 h11==0.14.0 httpcore==1.0.5 httptools==0.6.1 httpx==0.27.0 huggingface-hub==0.24.5 idna==3.7 interegular==0.3.3 Jinja2==3.1.4 jiter==0.5.0 jsonschema==4.23.0 jsonschema-specifications==2023.12.1 lark==1.2.2 llvmlite==0.43.0 lm-format-enforcer==0.10.1 MarkupSafe==2.1.5 mpmath==1.3.0 msgpack==1.0.8 multidict==6.0.5 multiprocess==0.70.16 nest-asyncio==1.6.0 networkx==3.3 ninja==1.11.1.1 numba==0.60.0 numpy==1.26.4 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-ml-py==12.560.30 nvidia-nccl-cu12==2.20.5 nvidia-nvjitlink-cu12==12.6.20 nvidia-nvtx-cu12==12.1.105 openai==1.41.0 outlines @ file:///outlines packaging==24.1 pandas==2.2.2 pillow==10.4.0 prometheus-fastapi-instrumentator==7.0.0 prometheus_client==0.20.0 protobuf==5.27.3 psutil==6.0.0 py-cpuinfo==9.0.0 pyairports==2.1.1 pyarrow==17.0.0 pycountry==24.6.1 pydantic==2.8.2 pydantic_core==2.20.1 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 pytz==2024.1 PyYAML==6.0.2 ray==2.34.0 referencing==0.35.1 regex==2024.7.24 requests==2.32.3 rpds-py==0.20.0 safetensors==0.4.4 sentencepiece==0.2.0 six==1.16.0 sniffio==1.3.1 starlette==0.38.2 sympy==1.13.2 tiktoken==0.7.0 tokenizers==0.19.1 torch==2.3.0 torchvision==0.18.0 tqdm==4.66.5 transformers==4.44.0 triton==2.3.0 typing_extensions==4.12.2 tzdata==2024.1 urllib3==2.2.2 uvicorn==0.30.6 uvloop==0.20.0 vllm==0.5.1 vllm-flash-attn==2.5.9 watchfiles==0.23.0 websockets==12.0 xformers==0.0.26.post1 xxhash==3.4.1 yarl==1.9.4 ```

Context for the issue:

I'm trying to write a post on using Outlines with Vast, and Vast needs everything to be based in a docker container to run, it would be great if users could start their workloads in the container without needing to install accelerate first.

rlouf commented 3 months ago

Thank you, happy to review a PR!

lapp0 commented 3 months ago

outlines.serve should support json https://outlines-dev.github.io/outlines/reference/serve/vllm/#querying-endpoint

Additionally, outlines.models.vllm supports json as well. Could you please clarify the issue you ran into when trying this?

aowen14 commented 3 months ago

@rlouf I would be happy to create a PR for the docker setup, first I want to fully answer @lapp0's question, as this might be important for why I would like to use accelerate. I would prefer to use vLLM.

I created a simple pydantic use case for vllm, transformers and serve, this is what the code was for each, and what was output. Since running these, I added params to the vLLM example and it started returning valid json, but I was expecting it to work out of the box as vLLM has default parameters + outlines should be restricting to correct schema(?)

Server Call Code:

import requests
from pydantic import BaseModel

# Define the Book model
class Book(BaseModel):
    title: str
    author: str
    year: int

# Define the request parameters
ip_address = "localhost"
port = "8000"
prompt = "Create a book entry with the fields title, author, and year"
schema = Book.model_json_schema()

# Create the request body
outlines_request = {
    "prompt": prompt,
    "schema": schema
}

print("Prompt: ", prompt)
# Make the API call
response = requests.post(f"http://{ip_address}:{port}/generate/", json=outlines_request)

# Check if the request was successful
if response.status_code == 200:
    result = response.json()
    print("Result:", result["text"])
else:
    print(f"Error: {response.status_code}, {response.text}")

Server command: python -m outlines.serve.serve --model="microsoft/Phi-3-mini-128k-instruct" --max-model-len 5000

Output:

Prompt:  Create a book entry with the fields title, author, and year
Result: ['Create a book entry with the fields title, author, and year{ "title": "The Great Gatsby", "author": "F']

VLLM Code:

from outlines import models, generate
from pydantic import BaseModel
from vllm import SamplingParams

class Book(BaseModel):
    title: str
    author: str
    year: int

print("\n\npydantic_vllm_example\n\n")

model = models.vllm("microsoft/Phi-3-mini-128k-instruct", max_model_len= 25000)
params = SamplingParams(temperature=0, top_k=-1)

generator = generate.json(model, Book)
prompt = "Create a book entry with the fields title, author, and year"
result = generator(prompt)
print("Prompt:",prompt)
print("Result:",result)

Output:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/lambda1/miniconda3/envs/outlines/lib/python3.11/site-packages/pydantic/main.py", line 1160, in parse_raw
[rank0]:     obj = parse.load_str_bytes(
[rank0]:           ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/lambda1/miniconda3/envs/outlines/lib/python3.11/site-packages/pydantic/deprecated/parse.py", line 49, in load_str_bytes
[rank0]:     return json_loads(b)  # type: ignore
[rank0]:            ^^^^^^^^^^^^^
[rank0]:   File "/home/lambda1/miniconda3/envs/outlines/lib/python3.11/json/__init__.py", line 346, in loads
[rank0]:     return _default_decoder.decode(s)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/lambda1/miniconda3/envs/outlines/lib/python3.11/json/decoder.py", line 337, in decode
[rank0]:     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
[rank0]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/lambda1/miniconda3/envs/outlines/lib/python3.11/json/decoder.py", line 353, in raw_decode
[rank0]:     obj, end = self.scan_once(s, idx)
[rank0]:                ^^^^^^^^^^^^^^^^^^^^^^
[rank0]: json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 42 (char 41)

[rank0]: During handling of the above exception, another exception occurred:

[rank0]: Traceback (most recent call last):
[rank0]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank0]:   File "<frozen runpy>", line 88, in _run_code
[rank0]:   File "/home/lambda1/AlexCode/Performance-Benchmarking/outlines_local_vllm.py", line 27, in <module>
[rank0]:     result = generator(prompt, sampling_params=params)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/lambda1/miniconda3/envs/outlines/lib/python3.11/site-packages/outlines/generate/api.py", line 511, in __call__
[rank0]:     return format(completions)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/lambda1/miniconda3/envs/outlines/lib/python3.11/site-packages/outlines/generate/api.py", line 497, in format
[rank0]:     return self.format_sequence(sequences)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/lambda1/miniconda3/envs/outlines/lib/python3.11/site-packages/outlines/generate/json.py", line 50, in <lambda>
[rank0]:     generator.format_sequence = lambda x: schema_object.parse_raw(x)
[rank0]:                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/lambda1/miniconda3/envs/outlines/lib/python3.11/site-packages/pydantic/main.py", line 1187, in parse_raw
[rank0]:     raise pydantic_core.ValidationError.from_exception_data(cls.__name__, [error])
[rank0]: pydantic_core._pydantic_core.ValidationError: 1 validation error for Book
[rank0]: __root__
[rank0]:   Unterminated string starting at: line 1 column 42 (char 41) [type=value_error.jsondecode, input_value='{ "title": "The Great Gatsby", "author": "F', input_type=str]

Transformers Code:

from outlines import models, generate
from pydantic import BaseModel
from outlines.samplers import greedy

class Book(BaseModel):
    title: str
    author: str
    year: int

model = models.transformers("microsoft/Phi-3-mini-128k-instruct", device="cuda:0")
print("\n\npydantic_transformers_example\n\n")
generator = generate.json(model, Book)
prompt = "Create a book entry with the fields title, author, and year"
result = generator(prompt)
print("Prompt:",prompt)
print("Result:",result)

Output:

Prompt: Create a book entry with the fields title, author, and year
Result: title='Invisible Cities' author='Italo Calvino' year=1974
lapp0 commented 3 months ago

I'll look into the bug with json handling in vLLM.

aowen14 commented 1 month ago

Hi, just checking in here. Any updates on when this/the relevant PR's might be finished? Mainly asking as it affects a content schedule where we would be talking about outlines. Thanks!

rlouf commented 1 month ago

Have you tried using vLLM's structured output feature in their OpenAI-compatible API? They use outlines under the hood.

aowen14 commented 1 month ago

I plan on getting there at some point soon but was waiting on this. I don't view using Outlines and Outlines via vLLM as mutually exclusive for our purposes as we were looking to make pieces about both :). I was thinking the original outlines post would be a good intro for both of them.

Also, I saw the release of Outlines-core, which could be another cool thing to put into the post as well.

I'm happy to go down the path of vLLM for this in the meantime!

rlouf commented 1 month ago

Happy to review a PR that adds accelerate to the image!