I'm a heavy user of outlines.models.Transformers and use the stream function after converting to a regex generator via outlines.generate.regex, however, when testing 0.1.0 for the speed improvements I notice that now when I run .stream it simply runs through the output all in one go and then outputs the streamer instantaneously.
Steps/code to reproduce the bug:
import time
import outlines
import torch
from outlines.samplers import MultinomialSampler
from transformers import AutoModelForCausalLM, AutoTokenizer
pretrained_ckpt = "meta-llama/Llama-3.2-3B"
model = AutoModelForCausalLM.from_pretrained(pretrained_ckpt, torch_dtype=torch.bfloat16,
trust_remote_code=True,
attn_implementation="flash_attention_2",
device_map="cuda:0")
tokenizer = AutoTokenizer.from_pretrained(pretrained_ckpt)
outlines_model = outlines.models.Transformers(model=model, tokenizer=tokenizer)
generator = outlines.generate.regex(outlines_model, "(accepted|rejected) the invitation by saying: \"[A-Za-z,.!?\- ']+\"")
streamer = generator.stream("Micheal ", max_tokens=100)
resp = ""
time_start = time.time()
for token in streamer:
resp += token
print(token)
print(time.time() - time_start)
print(resp)
Expected result:
On outlines==0.0.46 we get the following output with tokens being returned as they are being outputted.
accepted
0.3291473388671875
the
0.3473653793334961
0.3576169013977051
invitation
0.3677070140838623
by
0.3798065185546875
saying
0.3898754119873047
:
0.4002041816711426
"
0.41028761863708496
Perhaps
0.4254317283630371
it
0.4405210018157959
was
0.45534610748291016
D
0.4698038101196289
olf
0.4842183589935303
who
0.4987914562225342
happened
0.5133378505706787
to
0.5278193950653076
get
0.542212724685669
over
0.5566153526306152
there
0.5710551738739014
,
0.5878012180328369
so
0.6036765575408936
it
0.6190624237060547
may
0.6345548629760742
be
0.6498222351074219
coming
0.6652572154998779
from
0.6807975769042969
K
0.6953606605529785
la
0.709963321685791
as
0.7246041297912598
."
0.7391231060028076
accepted the invitation by saying: "Perhaps it was Dolf who happened to get over there, so it may be coming from Klaas."
Error message:
with outlines version 0.1.0 we get the following output with all the data essentially coming out at the same time.
a
0.8557868003845215
cc
0.8558142185211182
ept
0.8558330535888672
ed
0.855849027633667
the
0.8558666706085205
invitation
0.8558826446533203
by
0.8558969497680664
saying
0.855921745300293
:
0.8559391498565674
"
0.8559558391571045
I
0.8559701442718506
'll
0.8559842109680176
bring
0.8559982776641846
some
0.8560121059417725
of
0.8560261726379395
is
0.856043815612793
unique
0.8560581207275391
music
0.856072187423706
style
0.856086015701294
to
0.8560996055603027
the
0.8561127185821533
festival
0.8561265468597412
and
0.8561403751373291
will
0.8561539649963379
be
0.8561675548553467
spending
0.8561809062957764
some
0.8561944961547852
quality
0.8562085628509521
time
0.856226921081543
with
0.8562402725219727
the
0.8562536239624023
R
0.8562667369842529
aja
0.8562805652618408
Indian
0.8562946319580078
Office
0.8563082218170166
members
0.8563218116760254
and
0.856334924697876
introduce
0.8563485145568848
him
0.8563623428344727
to
0.8563754558563232
the
0.8563880920410156
new
0.8564021587371826
culture
0.8564155101776123
and
0.8564286231994629
music
0.8564414978027344
sounds
0.8564550876617432
of
0.8564684391021729
the
0.8564815521240234
Punjab
0.856494665145874
."
0.8565073013305664
0.8565213680267334
accepted the invitation by saying: "I'll bring some of is unique music style to the festival and will be spending some quality time with the Raja Indian Office members and introduce him to the new culture and music sounds of the Punjab."
### Outlines/Python version information:
Outlines versions 0.1.0 or 0.0.46
Python version 3.11.9
### Context for the issue:
I understand that this is a stand in which waiting for https://github.com/huggingface/transformers/issues/30810 but as there does not seem to be a due time for a resolution on that side is there a way to work around it. Is there a reason it's waiting for that issue to be resolved? It's not particularly clear.
Describe the issue as clearly as possible:
I'm a heavy user of
outlines.models.Transformers
and use thestream
function after converting to a regex generator viaoutlines.generate.regex
, however, when testing 0.1.0 for the speed improvements I notice that now when I run.stream
it simply runs through the output all in one go and then outputs the streamer instantaneously.Steps/code to reproduce the bug:
Expected result:
Error message: