High memory usage / low loading performance

JohannesErnst commented 7 months ago

With larger and larger state machines (10k - 1Mio states) we are running into notable problems regarding the loading time performance and, more importantly, memory usage. As it can be expected that the number of states will even increase for future autonomous tasks, this is important to tackle.

As the RAM of the robots is limited on most systems, consuming multiple GBs of memory for only loading a large state machine is problematic. The real issue is, that the memory consumption is 100-1000 times higher than the inherent information, regarding just the .json files loaded. To my current understanding, this is due to the object structure that is set up in python variables/references during runtime.

General Questions:

How to reduce the overhead of a loaded state machine?
Is a bug causing the high memory consumption?
Is the recursive nesting structure of states (which holds loads of redundant data) in the python variables the problem?

Connected to this issue might be the following topics:

We already had the following findings:

Using just the core (and no gui) to open large state machines does not improve the performance. Loading a demo state machine of ~30k states takes around 30s to load when using the gui and also when using just the core. Also the memory (RAM) usage is similar at around 2.5 - 3GB.
When loading the files locally from a laptop in offline mode that is not connected to the institute network (i.e. retrieving data from the server), there are no changes in performance. Therefore, the network infrastructure does not cause significant performance issues.
Upon further investigation of the state machine load function, we found that the majority of the time is consumed by reading the files. The reader called recursively in load_state_recursively.
- Read .json files (~70-90%)
- RAFCON logger commands (~10%)
- Other (<10%)
The number of hierarchy levels is not significantly changing the load time, but the number of state machines is. For example: Loading 24 state machines in 2 hierarchies takes approximately the same time as loading 24 state machines in 7 hierarchy levels but both are slower than loading 48 state machines. However, the number of hierarchy levels is significantly impacting the visual editors performance: I.e. the more hierarchies, the worse the performance of panning, zooming, adding, shifting...
When investigating the memory usage of a large state machine we had the following findings:
- State machines by themselves (when not loaded, just the .json) are rather tiny, whole state machines has 65MB
- Loading takes loads of memory, 2-3GB for (with core and GUI)
- When closing the state machines, the memory is not released, only when closing RAFCON
Use pympler for getting the size of objects:
- Before loading the large state machine:
- global_gui_config: 0.97 MB
- global_runtime_config: 0.97 MB
- library_manager: 0.97 MB
- library_manager_model: 0.97 MB
- state_machine_execution_engine: 0.97 MB
- state_machine_manager: 0.97 MB
- storage: 1.00 MB
- After loading the state machine:
- state_machine: 146 MB (the only non-global variable in this list)
- global_gui_config: 761.98 MB
- global_runtime_config: 734.87 MB
- library_manager: 802.64 MB
- state_machine_execution_engine: 761.98 MB (sketchy, why exactly the same? --> Down the nesting it's referring to the same objects)
- state_machine_manager: 761.98 MB
- storage: 725.30 MB
- It shows that when loaded, many variables are consuming a lot of memory. However, they are often referring to the same objects in lower levels of their nesting, therefore this listing does not represent the absolute memory consumption.
- The main bulge of memory is in rafcon_singletons and threading leading to the state objects
- The recursive structure holds loads of redundant information
- Structures like "...state_copy.parent.state_copy.parent.state_copy.parent..." exist
The current guess is when loading, the memory actually increases linearly depending on the amount of states (see image below). So the problem probably is rather that a single instance of a state is way to large in comparison to the data it is holding.

JohannesErnst commented 7 months ago

Maybe @franzlst @sebastian-brunner @Rbelder have some useful input on this?

Rbelder commented 7 months ago

I have not a lot of time right now but we have had digged into this topic already like 6 Years ago. There have to be open and closed issues surrounding this topic. Please search and look into the issues closed or still open.

We made similar observations and much different you made right now but the code has not changed a lot. E.g. using your local ssd or the network (for only core usage) or loading only core or with gui made a big difference even after we made improved to the situation. Therefore, we introduced the depths limit for mvc model generation. The used mvc pattern library generates many linkage and list entries what eats time. Therefore generally you could not look into library-state (see the inner child states) except you enable a feature. So maybe you made your test with a statemachine that has library-state all over in its root-state.

The memory debris/garbage is something we tackled with some unittest using gc. Yes there are some parts that stay but of your 2GB everything more the 1MB should be gone. Try to fit your example into the respective unit-test and check which objects remain. Anyway it is of interest if you can still run all of the test. Maybe a refactoring introduced a bug.

In general, in the past we made the observation that the usage of 100k or more states in a statemachine will not work without a kind of dynamic loading of parts of the statemachine and decided that this is a too hot feature (we made some trails) and at some point also bad-practice in use of RAFCON.

About the flat statemachine load feature by one file I re-comment Sebastian (@sebastian-brunner). I think they used internally a kind of plugin or at least were heading for this ability at Agile. Or it was even used in the robotic institute like 4-5 Years ago.

there is more to say from findings in the past ....

JohannesErnst commented 7 months ago

Thank you for the quick response! I have already searched through issues in this repo concerning the problem and inspected the ones listed above. They will definitely give some useful hints where to look.

We made similar observations and much different you made right now but the code has not changed a lot. E.g. using your local ssd or the network (for only core usage) or loading only core or with gui made a big difference even after we made improved to the situation.

For me the changes in these configurations were not that significant, although they definitely have some impact. However, as mentioned above, we are not looking for a factor of 10-50% slower/faster but rather try to find the significant discrepancy between the actual data that is stored in .json and the python variable of the state. If that is possible.

The memory debris/garbage is something we tackled with some unittest using gc. Yes there are some parts that stay but of your 2GB everything more the 1MB should be gone. Try to fit your example into the respective unit-test and check which objects remain. Anyway it is of interest if you can still run all of the test. Maybe a refactoring introduced a bug.

I definitely should check this more in detail, after all the data should be released when the state is closed. Regarding the tests, I actually ran all the unit tests (and updated a lot of deprecations and bugs that happened during the refactoring) before publishing this new release (2.1.1). I also stumbled across this test_memory.py which did run into some problems (used more time than specified in timeout) but ultimately passed successfully. Therefore, I don't think the current version is bugged due to refactoring.

In general, in the past we made the observation that the usage of 100k or more states in a statemachine will not work without a kind of dynamic loading of parts of the statemachine and decided that this is a too hot feature (we made some trails) and at some point also bad-practice in use of RAFCON.

This is the real scope of the problem we are discussing at the institute right now. We will probably soon hit the first state machines with 100k states which emphasizes this problem again. But it's some very valuable insight for me that you came to this conclusion in the past, so thanks again!

About the flat statemachine load feature by one file I re-comment Sebastian (@sebastian-brunner). I think they used internally a kind of plugin or at least were heading for this ability at Agile. Or it was even used in the robotic institute like 4-5 Years ago.

I couldn't find any traces of this in the current version or any pull request from agile (as commented above in the respective pull request). Unfortunately, I don't have any more information on this right now. While this would most likely make some improvement, I don't think it would significantly change anything regarding the memory problem.

Rbelder commented 7 months ago

Some how I mixed up my memories from the past. The loading from file we already tackled (so ssd versus network) by the following issue Preload libraries and use copy method to provide LibraryStates and respective models. Anyway, this may still does create lack if the hash function is used to often. So, as long as you have working garbage tests maybe the remaining memory usage comes from mentioned feature above. If you have a big config.yaml that preloads a lot of state machine and holds it so RAFCON will run into persistent high memory usage. But I have not digged into code regarding this right now and how persistent this feature is active.

You could check this by monitoring your memory consumption over multiple re-openings of a big state machine and destruction of those in a single RAFCON-Instance.

Anyway, I would recommend making a performance test, which does these checks automatically. Otherwise, you will not have clear measurements and waste a lot of time doing it manually.

like reopening statemachines and requesting the memory usage of the RAFCON Instance
best put your protocol (opened statemachine, used memory, used pc and it properties) of this test into the tmp folder This way you maybe find a good measure that will guide while trails and improvements.

I know, it needs time to write a test but whitout it you will not get permanent improvements and put a lot of time into manual measures. Most likely you already started this with your comments from above.

Rbelder commented 7 months ago

state_machine_execution_engine: 761.98 MB (sketchy, why exactly the same? --> Down the nesting it's referring to the same objects)

state_machine_manager: 761.98 MB

If you opened only one statemachine, the state machine manager as well as the execution engine (ExEngine) has the handle on this one state machine. If you open 2 or 3 state machine the ExEngine hold still one handle on one of those 3 state machine but the state machine manager holds handles on all state machines and will be bigger then the ExEngine. Anyway If you use this kind of measurement you have to be aware that every state holds the handle to it parent and maybe linkage following in pympler will label the object bigger then it is.

JohannesErnst commented 6 months ago

So, as long as you have working garbage tests maybe the remaining memory usage comes from mentioned feature above. [...] You could check this by monitoring your memory consumption over multiple re-openings of a big state machine and destruction of those in a single RAFCON-Instance.

Thanks, I will definitely look into that.

I know, it needs time to write a test but whitout it you will not get permanent improvements and put a lot of time into manual measures. Most likely you already started this with your comments from above.

Yes I agree with you that a proper test is needed when working on this problem. For now, I was still trying to figure out if the problem is more some kind of bug and was more looking through the code to find something that might hint to the problem. It's still strange to me that, although the state machine info from the .json is very small, the python objects are so large (by orders of magnitude!). Is this expected for you or do you also think the variables that hold the state machines should be smaller? But I am, as you probably too, quite sure that it's not just some simple bug when setting up the state machine. But yes, I should introduce a test as a baseline before continuing on this.

Anyway If you use this kind of measurement you have to be aware that every state holds the handle to it parent and maybe linkage following in pympler will label the object bigger then it is.

This experiment was more to figure out if I can find some variable that holds significant memory. But to my current understanding it's rather that a single state machine (i.e. just a execution state) is already way to big when loaded as a python object. This then leads to the high memory when 100k+ states are used (as it scales somewhat linearly with the amount of states).

Personally, I would first try to tackle this memory issue and work on the loading times afterwards (as they might be connected anyways). In any case, if the loading time is high it is not as big a problem as the general memory consumption for us right now.

sebastian-brunner commented 6 months ago

Thx for the investigation @JohannesErnst

Concerning loading times:

changing the following core config settings in the config.yaml should drastically improve loading times (however will also start to struggle with several k states):
- LOAD_SM_WITH_CHECKS: true
- NO_PROGRAMMATIC_CHANGE_OF_LIBRARY_STATES_PERFORMED: true (this depends on your application however, if you can do this; i.e. if you change the state-machine during state-machine execution e.g. by adding a specific grasp, you must not set this flag to true, see: https://rafcon.readthedocs.io/en/latest/configuration.html#core-configuration)
to improve the loading times more fundamentally you can consider:
- just-in-time loading of the required states (this will affect execution performance of course)
- only load each state once; before execution, copy the state (potentially remove it afterward again)
- always use the same state for execution and just duplicate the data context (i.e. input/output/scoped data)

Concerning memory consumption during execution: if you disable the execution history the memory won't build up anymore in the latest versions. You can opt to also only write the history to a file (FILE_SYSTEM_EXECUTION_HISTORY_ENABLE: false).

JohannesErnst commented 6 months ago

@sebastian-brunner thanks for the reply and infos!

I think regarding the configurations we already use almost optimal settings (except for NO_PROGRAMMATIC_CHANGE_OF_LIBRARY_STATES_PERFORMED, as we use template state machines).

Maybe we will decide to pursue this idea of dynamic loading as it is the most promising approach to make some fundamental improvements right now. But it will also consume quite some time so we will have to decide how urgent it is. Anyways, I will consider all your three suggestions, thanks again!

DLR-RM / RAFCON