Coming from pure `pip install outlines` (it didn't prompt me to install anything else) it took 1hr+ to generate 512 tokens constrained to a `r"```latex(.*?|\n)```"` regex. The FSM compiled to 100% fairly fast though. This was a 2B-4bit model on a 3090, of which all 24GB of VRAM were filled during generation (my prompt is like 20 tokens).
I had a similar experience in the past, which I "solved" by using HuggingFace's TGI for structured generation. It was a lot faster, which is weird because I thought they used Outlines under the hood.
Should I have went with an inference engine like pip install outlines[vllm]?
I had a similar experience in the past, which I "solved" by using HuggingFace's TGI for structured generation. It was a lot faster, which is weird because I thought they used Outlines under the hood.
Should I have went with an inference engine like
pip install outlines[vllm]
?Originally posted by @ahmed-moubtahij in https://github.com/dottxt-ai/outlines/issues/1149#issuecomment-2354062820