Outlines slows down inference of AQLM models #748

remiconnesson commented 6 months ago

Describe the issue as clearly as possible:

The reproduction file load Mixtral quantized with AQLM (this runs on a T4 on collab here :

When using outlines to force structure it is noticeably slower than when we dont.

The JSON example at the end never return

Steps/code to reproduce the bug:


# !pip install aqlm[gpu,cpu]
# !pip install git+
# !pip install outlines
# !pip install datasets

from transformers import (

import outlines

MODEL = "ISTA-DASLab/Mixtral-8x7B-Instruct-v0_1-AQLM-2Bit-1x16-hf"

to_kwargs = lambda **kwargs: kwargs

if "model" not in globals():
    model = outlines.models.transformers(
        model_kwargs=to_kwargs(trust_remote_code=True, torch_dtype="auto", device_map="cuda")

prompt = "What is the IP address of the Google DNS servers? "

generator = outlines.generate.text(model)
unstructured = generator(prompt, max_tokens=30)

generator = outlines.generate.regex(
structured = generator(prompt, max_tokens=30)

prompt = "<s>result of 9 + 9 = 18</s><s>result of 1 + 2 = "
answer = outlines.generate.format(model, int)(prompt)

prompt = "sqrt(2)="
generator = outlines.generate.format(model, float)
answer = generator(prompt, max_tokens=10)

import outlines

schema = '''{
    "title": "Character",
    "type": "object",
    "properties": {
        "name": {
            "title": "Name",
            "maxLength": 10,
            "type": "string"
        "age": {
            "title": "Age",
            "type": "integer"
        "armor": {"$ref": "#/definitions/Armor"},
        "weapon": {"$ref": "#/definitions/Weapon"},
        "strength": {
            "title": "Strength",
            "type": "integer"
    "required": ["name", "age", "armor", "weapon", "strength"],
    "definitions": {
        "Armor": {
            "title": "Armor",
            "description": "An enumeration.",
            "enum": ["leather", "chainmail", "plate"],
            "type": "string"
        "Weapon": {
            "title": "Weapon",
            "description": "An enumeration.",
            "enum": ["sword", "axe", "mace", "spear", "bow", "crossbow"],
            "type": "string"

generator = outlines.generate.json(model, schema)
character = generator("Give me a character description")

Expected result:

I've read that 
> "Outlines does not slow down inference,  but you can incur a small compilation cost at the beginning"
so this is most likely a bug in this case.

Expected results would be to have the same inference time.

Error message:

Outlines/Python version information:

Context for the issue:

AQLM opens up running Mixtral on a 16GB gpu (i.e Free Colab), being able to force structured output ouf of Mixtral would be very helpful for individuals and organization with low amount of GPU.

remiconnesson commented 6 months ago

AQLM support in VLLM is close to ready

When merged I'll check the speed of using (mixtral aqlm) + AQLM + vLLM + outlines is fast enough and if it's fast, I'll close :)