Zannick / logic-graph

Tools for video game logic representation and analysis, particularly routing and beatability checks for speedruns and randomizers.
MIT License
3 stars 0 forks source link

Address memory usage problems (again) #96

Closed Zannick closed 5 months ago

Zannick commented 11 months ago

Once again we seem to be running out of memory, and reducing the size of the statedb caches from 10+2 GiB to 5+1 GiB (the committed values), the OOM seems to be happening much faster.

Likely culprit is rocksdb as usual.

Zannick commented 11 months ago

It is somewhat hard to tell what is the culprit--running with the bytehound allocator doesn't result in the excessive memory usage.

Possible mitigations:

Zannick commented 11 months ago

I've managed to see excessive memory usage with bytehound, but I have yet to see it happen with bytehound producing a file I can actually examine or strip in whole.

Zannick commented 11 months ago

I found a problematic allocation in get_history_raw as called by recreate_store; it appears like a thread has gotten stuck in a loop pushing items into a vector, eventually allocating 6 GiB in one go.

Screenshot 2023-07-13 204435

If this is true, then somewhere we have two states pointing to each other. I imagine this might also be behind #95, as this could be the greedy thread itself getting stuck immediately, never returning from extract_solutions in order to increment the counter. However, I am pretty sure the initial state is getting recorded by virtue of the initial state being pushed into the queue.

Zannick commented 11 months ago

I have not seen anything recently; my last run of the program reached a runtime of over a week before I restarted it with more recent changes.

Zannick commented 8 months ago

This has occurred again, but thanks to #100 the program immediately exited so I still have the error and stack trace:

Eliding library internals:

Raw history found in statedb way too long, possible loop. Last 24:
[[A(Global__Deploy_Drone)], [A(Global__Recall_Drone)], [A(Global__Deploy_Drone)], [A(Global__Recall_Drone)], [A(Global__Deploy_Drone)], [A(Global__Recall_Drone)], [A(Global__Deploy_Drone)], [A(Global__Recall_Drone)], [A(Global__Deploy_Drone)], [A(Global__Recall_Drone)], [A(Global__Deploy_Drone)], [A(Global__Recall_Drone)], [A(Global__Deploy_Drone)], [A(Global__Recall_Drone)], [A(Global__Deploy_Drone)], [A(Global__Recall_Drone)], [A(Global__Deploy_Drone)], [A(Global__Recall_Drone)], [A(Global__Deploy_Drone)], [A(Global__Recall_Drone)], [A(Global__Deploy_Drone)], [A(Global__Recall_Drone)], [A(Global__Deploy_Drone)], [A(Global__Recall_Drone)]]
stack backtrace:
   0: rust_begin_unwind
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:595:5
   1: core::panicking::panic_fmt
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panicking.rs:67:14
   2: analyzer::db::HeapDB<W,T>::get_history_raw
   3: analyzer::db::HeapDB<W,T>::get_history
             at /home/bswolf/logic-graph/analyzer/src/db.rs:1210:9
   4: analyzer::algo::Search<W,T>::handle_solution
             at /home/bswolf/logic-graph/analyzer/src/algo.rs:396:39
   5: analyzer::algo::Search<W,T>::extract_solutions::{{closure}}
             at /home/bswolf/logic-graph/analyzer/src/algo.rs:466:21
[...]
  13: analyzer::algo::Search<W,T>::extract_solutions
             at /home/bswolf/logic-graph/analyzer/src/algo.rs:474:14
  14: analyzer::algo::Search<W,T>::recreate_store
             at /home/bswolf/logic-graph/analyzer/src/algo.rs:547:20
  15: analyzer::algo::Search<W,T>::handle_solution
             at /home/bswolf/logic-graph/analyzer/src/algo.rs:448:13
  16: analyzer::algo::Search<W,T>::extract_solutions::{{closure}}
             at /home/bswolf/logic-graph/analyzer/src/algo.rs:466:21
[...]
  24: analyzer::algo::Search<W,T>::extract_solutions
             at /home/bswolf/logic-graph/analyzer/src/algo.rs:474:14
  25: analyzer::algo::Search<W,T>::search::{{closure}}::{{closure}}
             at /home/bswolf/logic-graph/analyzer/src/algo.rs:716:50
[...]
  29: analyzer::heap::RocksBackedQueue<W,T>::extend_groups
             at /home/bswolf/logic-graph/analyzer/src/heap.rs:1064:26
  30: analyzer::algo::Search<W,T>::search::{{closure}}
             at /home/bswolf/logic-graph/analyzer/src/algo.rs:715:45
  31: analyzer::algo::Search<W,T>::search::{{closure}}::{{closure}}::{{closure}}
             at /home/bswolf/logic-graph/analyzer/src/algo.rs:761:40

This assert means that no one history entry went over, but we collected more than 1024 recorded steps, and each of which was one of the two reversible global actions.

Zannick commented 5 months ago

Closing as it hasn't happened in awhile and we've addressed other memory troubles.