Closed Zannick closed 5 months ago
It is somewhat hard to tell what is the culprit--running with the bytehound allocator doesn't result in the excessive memory usage.
Possible mitigations:
max_open_files
(and unset cache_index_and_filter_blocks
?)I've managed to see excessive memory usage with bytehound, but I have yet to see it happen with bytehound producing a file I can actually examine or strip in whole.
I found a problematic allocation in get_history_raw
as called by recreate_store
; it appears like a thread has gotten stuck in a loop pushing items into a vector, eventually allocating 6 GiB in one go.
If this is true, then somewhere we have two states pointing to each other. I imagine this might also be behind #95, as this could be the greedy thread itself getting stuck immediately, never returning from extract_solutions
in order to increment the counter. However, I am pretty sure the initial state is getting recorded by virtue of the initial state being push
ed into the queue.
I have not seen anything recently; my last run of the program reached a runtime of over a week before I restarted it with more recent changes.
This has occurred again, but thanks to #100 the program immediately exited so I still have the error and stack trace:
Eliding library internals:
Raw history found in statedb way too long, possible loop. Last 24:
[[A(Global__Deploy_Drone)], [A(Global__Recall_Drone)], [A(Global__Deploy_Drone)], [A(Global__Recall_Drone)], [A(Global__Deploy_Drone)], [A(Global__Recall_Drone)], [A(Global__Deploy_Drone)], [A(Global__Recall_Drone)], [A(Global__Deploy_Drone)], [A(Global__Recall_Drone)], [A(Global__Deploy_Drone)], [A(Global__Recall_Drone)], [A(Global__Deploy_Drone)], [A(Global__Recall_Drone)], [A(Global__Deploy_Drone)], [A(Global__Recall_Drone)], [A(Global__Deploy_Drone)], [A(Global__Recall_Drone)], [A(Global__Deploy_Drone)], [A(Global__Recall_Drone)], [A(Global__Deploy_Drone)], [A(Global__Recall_Drone)], [A(Global__Deploy_Drone)], [A(Global__Recall_Drone)]]
stack backtrace:
0: rust_begin_unwind
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:595:5
1: core::panicking::panic_fmt
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panicking.rs:67:14
2: analyzer::db::HeapDB<W,T>::get_history_raw
3: analyzer::db::HeapDB<W,T>::get_history
at /home/bswolf/logic-graph/analyzer/src/db.rs:1210:9
4: analyzer::algo::Search<W,T>::handle_solution
at /home/bswolf/logic-graph/analyzer/src/algo.rs:396:39
5: analyzer::algo::Search<W,T>::extract_solutions::{{closure}}
at /home/bswolf/logic-graph/analyzer/src/algo.rs:466:21
[...]
13: analyzer::algo::Search<W,T>::extract_solutions
at /home/bswolf/logic-graph/analyzer/src/algo.rs:474:14
14: analyzer::algo::Search<W,T>::recreate_store
at /home/bswolf/logic-graph/analyzer/src/algo.rs:547:20
15: analyzer::algo::Search<W,T>::handle_solution
at /home/bswolf/logic-graph/analyzer/src/algo.rs:448:13
16: analyzer::algo::Search<W,T>::extract_solutions::{{closure}}
at /home/bswolf/logic-graph/analyzer/src/algo.rs:466:21
[...]
24: analyzer::algo::Search<W,T>::extract_solutions
at /home/bswolf/logic-graph/analyzer/src/algo.rs:474:14
25: analyzer::algo::Search<W,T>::search::{{closure}}::{{closure}}
at /home/bswolf/logic-graph/analyzer/src/algo.rs:716:50
[...]
29: analyzer::heap::RocksBackedQueue<W,T>::extend_groups
at /home/bswolf/logic-graph/analyzer/src/heap.rs:1064:26
30: analyzer::algo::Search<W,T>::search::{{closure}}
at /home/bswolf/logic-graph/analyzer/src/algo.rs:715:45
31: analyzer::algo::Search<W,T>::search::{{closure}}::{{closure}}::{{closure}}
at /home/bswolf/logic-graph/analyzer/src/algo.rs:761:40
This assert means that no one history entry went over, but we collected more than 1024 recorded steps, and each of which was one of the two reversible global actions.
Closing as it hasn't happened in awhile and we've addressed other memory troubles.
Once again we seem to be running out of memory, and reducing the size of the statedb caches from 10+2 GiB to 5+1 GiB (the committed values), the OOM seems to be happening much faster.
Likely culprit is rocksdb as usual.