DLR-RM / RAFCON

RAFCON (RMC advanced flow control) uses hierarchical state machines, featuring concurrent state execution, to represent robot programs. It ships with a graphical user interface supporting the creation of state machines and contains IDE like debugging mechanisms. Alternatively, state machines can programmatically be generated using RAFCON's API.
https://dlr-rm.github.io/RAFCON/
Eclipse Public License 1.0
181 stars 34 forks source link

High memory usage / low loading performance #884

Open JohannesErnst opened 7 months ago

JohannesErnst commented 7 months ago

With larger and larger state machines (10k - 1Mio states) we are running into notable problems regarding the loading time performance and, more importantly, memory usage. As it can be expected that the number of states will even increase for future autonomous tasks, this is important to tackle.

As the RAM of the robots is limited on most systems, consuming multiple GBs of memory for only loading a large state machine is problematic. The real issue is, that the memory consumption is 100-1000 times higher than the inherent information, regarding just the .json files loaded. To my current understanding, this is due to the object structure that is set up in python variables/references during runtime.

General Questions:

Connected to this issue might be the following topics:

We already had the following findings:

JohannesErnst commented 7 months ago

Maybe @franzlst @sebastian-brunner @Rbelder have some useful input on this?

Rbelder commented 7 months ago

I have not a lot of time right now but we have had digged into this topic already like 6 Years ago. There have to be open and closed issues surrounding this topic. Please search and look into the issues closed or still open.

We made similar observations and much different you made right now but the code has not changed a lot. E.g. using your local ssd or the network (for only core usage) or loading only core or with gui made a big difference even after we made improved to the situation. Therefore, we introduced the depths limit for mvc model generation. The used mvc pattern library generates many linkage and list entries what eats time. Therefore generally you could not look into library-state (see the inner child states) except you enable a feature. So maybe you made your test with a statemachine that has library-state all over in its root-state.

The memory debris/garbage is something we tackled with some unittest using gc. Yes there are some parts that stay but of your 2GB everything more the 1MB should be gone. Try to fit your example into the respective unit-test and check which objects remain. Anyway it is of interest if you can still run all of the test. Maybe a refactoring introduced a bug.

In general, in the past we made the observation that the usage of 100k or more states in a statemachine will not work without a kind of dynamic loading of parts of the statemachine and decided that this is a too hot feature (we made some trails) and at some point also bad-practice in use of RAFCON.

About the flat statemachine load feature by one file I re-comment Sebastian (@sebastian-brunner). I think they used internally a kind of plugin or at least were heading for this ability at Agile. Or it was even used in the robotic institute like 4-5 Years ago.

there is more to say from findings in the past ....

JohannesErnst commented 7 months ago

Thank you for the quick response! I have already searched through issues in this repo concerning the problem and inspected the ones listed above. They will definitely give some useful hints where to look.

We made similar observations and much different you made right now but the code has not changed a lot. E.g. using your local ssd or the network (for only core usage) or loading only core or with gui made a big difference even after we made improved to the situation.

For me the changes in these configurations were not that significant, although they definitely have some impact. However, as mentioned above, we are not looking for a factor of 10-50% slower/faster but rather try to find the significant discrepancy between the actual data that is stored in .json and the python variable of the state. If that is possible.

The memory debris/garbage is something we tackled with some unittest using gc. Yes there are some parts that stay but of your 2GB everything more the 1MB should be gone. Try to fit your example into the respective unit-test and check which objects remain. Anyway it is of interest if you can still run all of the test. Maybe a refactoring introduced a bug.

I definitely should check this more in detail, after all the data should be released when the state is closed. Regarding the tests, I actually ran all the unit tests (and updated a lot of deprecations and bugs that happened during the refactoring) before publishing this new release (2.1.1). I also stumbled across this test_memory.py which did run into some problems (used more time than specified in timeout) but ultimately passed successfully. Therefore, I don't think the current version is bugged due to refactoring.

In general, in the past we made the observation that the usage of 100k or more states in a statemachine will not work without a kind of dynamic loading of parts of the statemachine and decided that this is a too hot feature (we made some trails) and at some point also bad-practice in use of RAFCON.

This is the real scope of the problem we are discussing at the institute right now. We will probably soon hit the first state machines with 100k states which emphasizes this problem again. But it's some very valuable insight for me that you came to this conclusion in the past, so thanks again!

About the flat statemachine load feature by one file I re-comment Sebastian (@sebastian-brunner). I think they used internally a kind of plugin or at least were heading for this ability at Agile. Or it was even used in the robotic institute like 4-5 Years ago.

I couldn't find any traces of this in the current version or any pull request from agile (as commented above in the respective pull request). Unfortunately, I don't have any more information on this right now. While this would most likely make some improvement, I don't think it would significantly change anything regarding the memory problem.

Rbelder commented 7 months ago

Some how I mixed up my memories from the past. The loading from file we already tackled (so ssd versus network) by the following issue Preload libraries and use copy method to provide LibraryStates and respective models. Anyway, this may still does create lack if the hash function is used to often. So, as long as you have working garbage tests maybe the remaining memory usage comes from mentioned feature above. If you have a big config.yaml that preloads a lot of state machine and holds it so RAFCON will run into persistent high memory usage. But I have not digged into code regarding this right now and how persistent this feature is active.

You could check this by monitoring your memory consumption over multiple re-openings of a big state machine and destruction of those in a single RAFCON-Instance.

Anyway, I would recommend making a performance test, which does these checks automatically. Otherwise, you will not have clear measurements and waste a lot of time doing it manually.

I know, it needs time to write a test but whitout it you will not get permanent improvements and put a lot of time into manual measures. Most likely you already started this with your comments from above.

Rbelder commented 7 months ago
  • state_machine_execution_engine: 761.98 MB (sketchy, why exactly the same? --> Down the nesting it's referring to the same objects)

  • state_machine_manager: 761.98 MB

If you opened only one statemachine, the state machine manager as well as the execution engine (ExEngine) has the handle on this one state machine. If you open 2 or 3 state machine the ExEngine hold still one handle on one of those 3 state machine but the state machine manager holds handles on all state machines and will be bigger then the ExEngine. Anyway If you use this kind of measurement you have to be aware that every state holds the handle to it parent and maybe linkage following in pympler will label the object bigger then it is.

JohannesErnst commented 6 months ago

So, as long as you have working garbage tests maybe the remaining memory usage comes from mentioned feature above. [...] You could check this by monitoring your memory consumption over multiple re-openings of a big state machine and destruction of those in a single RAFCON-Instance.

Thanks, I will definitely look into that.

I know, it needs time to write a test but whitout it you will not get permanent improvements and put a lot of time into manual measures. Most likely you already started this with your comments from above.

Yes I agree with you that a proper test is needed when working on this problem. For now, I was still trying to figure out if the problem is more some kind of bug and was more looking through the code to find something that might hint to the problem. It's still strange to me that, although the state machine info from the .json is very small, the python objects are so large (by orders of magnitude!). Is this expected for you or do you also think the variables that hold the state machines should be smaller? But I am, as you probably too, quite sure that it's not just some simple bug when setting up the state machine. But yes, I should introduce a test as a baseline before continuing on this.

Anyway If you use this kind of measurement you have to be aware that every state holds the handle to it parent and maybe linkage following in pympler will label the object bigger then it is.

This experiment was more to figure out if I can find some variable that holds significant memory. But to my current understanding it's rather that a single state machine (i.e. just a execution state) is already way to big when loaded as a python object. This then leads to the high memory when 100k+ states are used (as it scales somewhat linearly with the amount of states).

Personally, I would first try to tackle this memory issue and work on the loading times afterwards (as they might be connected anyways). In any case, if the loading time is high it is not as big a problem as the general memory consumption for us right now.

sebastian-brunner commented 6 months ago

Thx for the investigation @JohannesErnst

Concerning loading times:

Concerning memory consumption during execution: if you disable the execution history the memory won't build up anymore in the latest versions. You can opt to also only write the history to a file (FILE_SYSTEM_EXECUTION_HISTORY_ENABLE: false).

JohannesErnst commented 6 months ago

@sebastian-brunner thanks for the reply and infos!

I think regarding the configurations we already use almost optimal settings (except for NO_PROGRAMMATIC_CHANGE_OF_LIBRARY_STATES_PERFORMED, as we use template state machines).

Maybe we will decide to pursue this idea of dynamic loading as it is the most promising approach to make some fundamental improvements right now. But it will also consume quite some time so we will have to decide how urgent it is. Anyways, I will consider all your three suggestions, thanks again!