Open viktor-ferenczi opened 8 months ago
@lapp0 Frozen FSM, maybe a low hanging fruit. But I don't know the outlines code enough to see the problem right away. The regex seems to be good, but a bit long. Still, should not cause a complete freeze.
Making long expressions efficient is a work in progress. interegular
isn't quite there for complex expressions.
Related: https://github.com/outlines-dev/outlines/issues/658
Your expression results in a very large FSM. Before we even crawling the FSM to create a token index, interegular.parse_pattern().to_fsm()
takes over 10 minutes to compile, and FSM.reduce()
takes over 30 minutes I don't think it's getting into an infinite loop, this is just a very expensive operation.
Detailed profile for the below code follows
import interegular
def profile_fsm_construction(pattern_str):
pat = interegular.parse_pattern(pattern_str)
fsm = pat.to_fsm()
reduced = fsm.reduce()
pattern = """(Path: `Shop\.Service/OrderService\.cs`\n\n\n(\n|[^`].*?\n)*\n\n)?(Path: `Shop\.Web/Pages/Order/Archive\.cshtml\.cs`\n\n```cs\n(\n|[^`].*?\n)*```\n\n)?(Path: `Shop\.Web/Controllers/OrderController\.cs`\n\n```cs\n(\n|[^`].*?\n)*```\n\n)?(Path: `Shop\.Web/Controllers/AccountController\.cs`\n\n```cs\n(\n|[^`].*?\n)*```\n\n)?(Path: `Shop\.Data/Enums/OrderBy\.cs`\n\n```cs\n(\n|[^`].*?\n)*```\n\n)?(Path: `Shop\.Data/IOrder\.cs`\n\n```cs\n(\n|[^`].*?\n)*```\n\n)?(New: `(.*?)`\n\n```([a-z]+)\n(\n|[^`].*?\n)*```\n\n)?(New: `(.*?)`\n\n```([a-z]+)\n(\n|[^`].*?\n)*```\n\n)?(New: `(.*?)`\n\n```([a-z]+)\n(\n|[^`].*?\n)*```\n\n)?END\n"""
if __name__ == "__main__":
import pstats
import cProfile
cProfile.run('profile_fsm_construction(pattern)', 'profile_stats')
p = pstats.Stats('profile_stats')
p.sort_stats('cumtime').print_stats()
For now I recommend trying to simplify the expression. However, in the long term - this isn't the first time this has come up, and should be addressed. I'll ponder how the crawl
function (and the passed follow
s) could be made more efficient.
This is a blocker for me, so gave the optimization a try.
Thank you for the test case, that was a highly useful start.
In fsm.py
of interegular
:
def final(state):
"""If you're in a final state of the final FSM, it's final"""
for (i, substate) in state:
if i == last_index and substate in last.finals:
return True
return False
Since both state
and last.finals
are frozenset
it can be optimized as:
def final(state):
"""If you're in a final state of the final FSM, it's final"""
if len(state) < len(last.finals):
for (i, substate) in state:
if i == last_index and substate in last.finals:
return True
else:
for final_substate in last.finals:
if (last_index, final_substate) in state:
return True
return False
But this barely makes a difference (<2%). That's because this code scales much worse:
In the crawl
function there is this line:
j = states.index(next)
The number of items in states
goes up way above 1000 even if I simplify my prompt down to 5 concatenated segments. It has O(N^2) time complexity, which is clearly not optimal for these kinds of patterns. Also, the code is using exception handling (ValueError
) for logic, so it construct a useless Traceback
whenever an index lookup fails.🤦♂️Also, states
is not used at the end, only its length.
Attempted to optimize the crawl
function by mapping state values to their indices, but the states mix up data types, specifically set
, frozenset
and dict
.
At this point I gave up.
Will could use Lark grammar instead for my purposes, because interegular
is badly written, but that's not available via the vLLM REST API yet (looked into outlines.serve.serve.generate
).
I'm blocked.
There is also this expensive double-reversal in interegular.fsm
:
def reduce(self):
"""
A result by Brzozowski (1963) shows that a minimal finite state machine
equivalent to the original can be obtained by reversing the original
twice.
"""
return self.reversed().reversed()
Do we really that badly need a minimal finite state machine for our purposes?
However, it remains super slow even if I remove the above, so it would not help anyway.
I made crawl
about 20x faster with
def crawl(alphabet, initial, final, follow):
"""
Given the above conditions and instructions, crawl a new unknown FSM,
mapping its states, final states and transitions. Return the new FSM.
This is a pretty powerful procedure which could potentially go on
forever if you supply an evil version of follow().
"""
def get_hash(obj):
if isinstance(obj, set):
return hash(frozenset(obj))
elif isinstance(obj, dict):
return hash(tuple(sorted(obj.items())))
return hash(obj)
states = [initial]
state_idx = {get_hash(initial): 0}
finals = set()
map = {}
# iterate over a growing list
i = 0
while i < len(states):
state = states[i]
# add to finals
if final(state):
finals.add(i)
# compute map for this state
map[i] = {}
for transition in alphabet.by_transition:
try:
next = follow(state, transition)
next_hash = get_hash(next)
except OblivionError:
# Reached an oblivion state. Don't list it.
continue
else:
try:
j = state_idx[next_hash]
except KeyError:
j = len(states)
states.append(next)
if next_hash not in state_idx:
state_idx[next_hash] = j
map[i][transition] = j
i += 1
return FSM(
alphabet=alphabet,
states=range(len(states)),
initial=0,
finals=finals,
map=map,
__no_validation__=True,
)
The bottleneck is the reduce()
function you referenced. If you implement Hopcroft's algorithm instead there might be two orders of magnitude of improvement.
However, it remains super slow even if I remove the above, so it would not help anyway.
I can look into optimizing the subsequent steps as well.
Maybe we could use Python's compiled regex representation to come up with a better FSM ourselves without having to use interegular
. Even if it would be limited it may work more stable / efficiently.
Python's regex
The compiler parses the regex into a kind of byte-code (see OPCODES
), then the parser executes it.
I think a better approach would be to get the Lark grammar working with the vLLM endpoint (outlines.server.serve
), so we can use that for the more complex cases instead of regex.
Please do not use exception handling for logic. Thanks.
j = state_idx.get(next_hash)
if j is None:
j = len(states)
states.append(next)
if next_hash not in state_idx:
state_idx[next_hash] = j
Thank you for the crawl
optimization. I've just tried it, but that's not nearly enough to get my query working. Need to look for alternatives, best would be to finish adding support for the Lark grammar, so I can skip interegular
with these kind of queries.
The crawl optimization mainly improves the performance of to_fsm()
. For reduce()
we need Hopcroft I think.
Hoping to get CFG
for vLLM soon https://github.com/outlines-dev/outlines/pull/676
Trying to use this simpler regex in the meantime, but it allows the model to produce wrong output, which is not ideal:
(Path: `(.*?)`\n\n`{3}([a-z]+)\n(\n|[^`].*?\n)*`{3}\n\n)+(New: `(.*?)`\n\n`{3}([a-z]+)\n(\n|[^`].*?\n)*`{3}\n\n)*
Describe the issue as clearly as possible:
Observed that the vLLM server gets stuck indefinitely. Running in the debugger I could stop where it was frozen. It is running this while loop infinitely, because it always appends a new item to states, therefore it can never finish the loop (the list is just getting longer all the time):
def crawl(alphabet, initial, final, follow): """ Given the above conditions and instructions, crawl a new unknown FSM, mapping its states, final states and transitions. Return the new FSM. This is a pretty powerful procedure which could potentially go on forever if you supply an evil version of follow(). """
Steps/code to reproduce the bug:
Expected result:
Error message:
Outlines/Python version information:
Version information
Context for the issue:
I'm just running some heavy code understanding and generation workload through outlines.