fgmacedo / python-statemachine

Python Finite State Machines made easy.
MIT License
890 stars 85 forks source link

State Recursion #484

Open sarah-seidman opened 4 days ago

sarah-seidman commented 4 days ago

Description

I've created a state machine to control an application that runs on an embedded Linux system and advances between states based on time or input from external sensors. It starts in a "startup" state, checks a few conditions, and if all is well, asks the user to complete a test. Based on the result of that action, we either ask for it to be repeated or advance to a different stage, where we wait for something else to happen, etc. This is a great package and has substantially simplified our logic. However, I am having some issues.

This is an excerpt from my code.

    start = (
        # If no adverse conditions are true and it's the right time, go straight to the test
        | startup.to(breath_test, cond="time_for_test", after="do_test")
        | startup.to.itself(internal=True, after="start")
    )

    do_test = (
        | breath_test.to(engine_start, cond="test_passed", after="do_engine_start")
        # If we failed but are not over the test limit, just try again
        | breath_test.to.itself(cond=["test_failed", "time_for_test"], after="do_test")
        | breath_test.to(sleep, cond="accessory_lost")
        # If the result isn't ready, or there was an error, stay here
        | breath_test.to.itself(internal=True, after="do_test")
    )

The code works as-is. That is, it takes the user through all the states as desired. However, my team recently noticed that after about an hour of running the state machine, our embedded device crashes. It seems that the state machine is leaking memory. I tested this by running top and watching the memory used by this process steadily increase over an hour. I also used tracemalloc to check what objects are consuming so much memory and saw that it was from line 34 of events.py in this package. It seems like each time there's a recursion in my code, it adds another Event to the internal queue.

I noticed today that there's a "hint" section in your documentation about recursion, exactly like what's in my code. However, I never saw a RecursionError (although, maybe one did happen and was written to the journal, but I couldn't read it since the device crashed...). The code runs fine, until (I assume based on testing), the device runs out of memory and crashes.

It’s also possible to use an event name as action to chain transitions.

Be careful to not introduce recursion errors, like loop = initial.to.itself(after="loop"), that will raise RecursionError exception.

If not by recursing like this, how should I deal with this case in my code? In the above example, I want to stay in the "breath_test" state until one of the exit conditions is true. I changed my code to this in order to avoid recursion

    do_test = (
        | breath_test.to(engine_start, cond="test_passed", after="do_engine_start")
        | breath_test.to(sleep, cond="accessory_lost")
        # If the result isn't ready, or there was an error, stay here
        | breath_test.to.itself(internal=True)
    )

However, when the logic cascades to the internal transition on the last line, the program just exits.

I see that in the code examples you provide, the state machines generally advance to the next state because an event is called by code external to the state machine. Is that the only correct way to use this library? I run the state machine by instantiating the object and calling the start function. Using recursion (incorrectly, I guess?), the state machine executes indefinitely until the final state is entered.

sm = MyStateMachine()
sm.start()

It is definitely desirable to have the state machine control itself once it starts, and trigger all its own events. Obviously, this causing some problems as I have currently implemented it. Maybe it's not really possible to use the library this way? Is there any way to implement this kind of logic without dangerous recursion?

fgmacedo commented 4 days ago

Hi @sarah-seidman , how are you?

Thanks for reporting this. There's a huge diff in code since I wrote that statement into the docs. And if it's still valid, should be only for non-RTC mode (don't uses a queue internally), that is not the default.

I think that this kind of construct should be supported by the library. But I think that there are few users or not long running provesses, so the problem didn't come up until now. This behavior has low automated test coverage, we can also improve on this aspect.

Can you please provide a snippet that reproduces the error so I can quickly add to the tests and jump in?

sarah-seidman commented 2 days ago

Hi @fgmacedo, thanks for your reply! Here is an example that reproduces the problem.

import tracemalloc
import time
from statemachine import StateMachine, State

tracemalloc.start()

class MyStateMachine(StateMachine):
    startup = State(initial=True)
    test = State()

    before_state = after_state = None

    do_startup = (
        startup.to(test, after="do_test")
    )
    do_test = (
        test.to(startup, after="do_startup")
    )

    def on_enter_state(self):
        self.before_state = tracemalloc.take_snapshot()

    def on_exit_state(self):
        time.sleep(1)
        self.after_state = tracemalloc.take_snapshot()
        diff = self.after_state.compare_to(self.before_state, "lineno")
        for line in diff[:10]:
            if "event" in str(line):
                print(line)

sm = MyStateMachine()
sm.do_startup()

And some sample output running for a few seconds. This is similar to the output I observed from tracing memory usage in my actual code.

sarah@ubuntu:~$ python3 test.py 
/home/sarah/.local/lib/python3.8/site-packages/statemachine/events.py:31: size=840 B (+840 B), count=2 (+2), average=420 B
/home/sarah/.local/lib/python3.8/site-packages/statemachine/event.py:34: size=536 B (+536 B), count=2 (+2), average=268 B
/home/sarah/.local/lib/python3.8/site-packages/statemachine/event.py:34: size=1464 B (+928 B), count=6 (+4), average=244 B
/home/sarah/.local/lib/python3.8/site-packages/statemachine/event_data.py:69: size=720 B (+592 B), count=4 (+2), average=180 B
/home/sarah/.local/lib/python3.8/site-packages/statemachine/event_data.py:75: size=0 B (-592 B), count=0 (-2)
/home/sarah/.local/lib/python3.8/site-packages/statemachine/event.py:34: size=1872 B (+408 B), count=9 (+3), average=208 B
/home/sarah/.local/lib/python3.8/site-packages/statemachine/event.py:34: size=2232 B (+408 B), count=11 (+3), average=203 B
/home/sarah/.local/lib/python3.8/site-packages/statemachine/event.py:34: size=2592 B (+408 B), count=13 (+3), average=199 B
/home/sarah/.local/lib/python3.8/site-packages/statemachine/event.py:34: size=2952 B (+408 B), count=15 (+3), average=197 B
/home/sarah/.local/lib/python3.8/site-packages/statemachine/event.py:34: size=3264 B (+360 B), count=16 (+2), average=204 B
/home/sarah/.local/lib/python3.8/site-packages/statemachine/event.py:34: size=3672 B (+408 B), count=19 (+3), average=193 B
/home/sarah/.local/lib/python3.8/site-packages/statemachine/event.py:34: size=4032 B (+408 B), count=21 (+3), average=192 B
/home/sarah/.local/lib/python3.8/site-packages/statemachine/event.py:34: size=4344 B (+360 B), count=22 (+2), average=197 B
/home/sarah/.local/lib/python3.8/site-packages/statemachine/event.py:34: size=4752 B (+408 B), count=25 (+3), average=190 B
/home/sarah/.local/lib/python3.8/site-packages/statemachine/event.py:34: size=5112 B (+408 B), count=27 (+3), average=189 B

For the time being, I found a workaround to be able to run my code without too much modification. I removed most of the "afters" from my state transitions and added this code to main. Not the most elegant solution, maybe, but it is working :)

    while True:
        for event in state_machine.events:
            try:
                state_machine.send(event.name)
            except statemachine.exceptions.TransitionNotAllowed:
                pass
fgmacedo commented 2 days ago

Hi @sarah-seidman, just an update here... I was able to reproduce the issue.

I'm researching and trying some approaches but still working on this. It's a memory leak of some objects.

Here's a report of guppy3, so the lib on the "recursion" leaks ~3656 bytes per loop:

Partition of a set of 20 objects. Total size = 3656 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0      7  35     1696  46      1696  46 dict (no owner)
     1      7  35     1496  41      3192  87 types.FrameType
     2      1   5      136   4      3328  91 dict of statemachine.event_data.EventData
     3      1   5      112   3      3440  94 dict of statemachine.event_data.TriggerData
     4      1   5       64   2      3504  96 inspect.BoundArguments
     5      1   5       56   2      3560  97 list
     6      1   5       48   1      3608  99 statemachine.event_data.EventData
     7      1   5       48   1      3656 100 statemachine.event_data.TriggerData