dusk-network / rusk

The reference Dusk platform implementation and tools
Mozilla Public License 2.0
163 stars 60 forks source link

Memory increases unexpectedly #588

Closed herr-seppia closed 2 years ago

herr-seppia commented 2 years ago

Describe the bug The memory usage during the transaction creation grows depending on the block height. This results in a OOM (initially thought to be solely at high block_height, but later discovered to be present even at lower block height).

To Reproduce Run a local cluster with no rate limiter and the broadcast txs

Logs/Screenshot

2022-02-21T09:38:34.517933Z  INFO rusk::services::state: Received GetNotesOwnedBy request
(after this the memory is dropped correctly)

2022-02-21T09:38:59.102467Z  INFO rusk::services::state: Received FindExistingNullifiers request
2022-02-21T09:38:59.145421Z  INFO rusk::services::state: Received GetOpening request
2022-02-21T09:38:59.187720Z  INFO rusk::services::state: Received GetAnchor request
2022-02-21T09:38:59.228901Z  INFO rusk::services::prover: Received prove_execute request
(after this process has been killed)

Platform Running both rusk and dusk-blockchain with branch wip-vm11

Additional context Block height was at 26761 Rusk has been killed by the OOMKiller on a vps with a 1GB of memory. Calling getBalance multiple time at block height doesn't kill the process (the memory grows and come back at the initial level after the operation has been performed)

Update from "OOM when sending tx at high block height" [TBD]

ureeves commented 2 years ago

You're hinting at the fact that the wallet is not the cause, could you hint at what you do think is the cause? From the information I have here it looks like the prove_execute RPC (we should really change the text there) tries to load some memory - perhaps a prover key - that is too much for the environment. The largest prover keys are around 0.52GB, meaning loading one of them would consume around 50% of the memory available in the given VPS. Maybe the "solution" here is to just give it more memory? What do you think @herr-seppia?

herr-seppia commented 2 years ago

The problems seems not related to the prover key, the prover key is the cause of the OOM, but not the cause of the leak

Just launched a cluster (always with wip-vm11) with no rate limit at all, and the memory is continually increasing even with no txs sent

Seems to me that the cause is related to microkelvin

herr-seppia commented 2 years ago

I launched a cluster with 10 nodes, and I start to think it's something related to grpc

Node0 is the BlockGenerator (so dusk-blockchain call EST multiple times, as it's the only node in the consensus) Node1-7 are passive nodes (they only receive accepted blocks and call no EST at all) Node8 have no dusk-blockchain connected (no GRPC is served at all, it acts just as a network router) Node9 is contacted by the Explorer and the Wallet (same as nodes 1-7, but with additional clients)

wip20220224-node-0
%MEM
 9.9
-----
wip20220224-node-1
%MEM
 7.4
-----
wip20220224-node-2
%MEM
 7.4
-----
wip20220224-node-3
%MEM
 7.4
-----
wip20220224-node-4
%MEM
 7.7
-----
wip20220224-node-5
%MEM
 7.3
-----
wip20220224-node-6
%MEM
 7.4
-----
wip20220224-node-7
%MEM
 7.4
-----
wip20220224-node-8
%MEM
 2.5
-----
wip20220224-node-9
%MEM
36.0
-----

After restarted and resync the node-9 the amount of memory is the following

wip20220224-node-9
%MEM
 8.0
goshawk-3 commented 2 years ago

As per mem profiler, a leakage was detected from STATIC_MAP

From github.com-1ecc6299db9ec823/canonical-0.6.6/src/id.rs

impl Id {
    /// Creates a new Id from a type
    pub fn new<T>(t: &T) -> Self
    where
        T: Canon,
    {
        let len = t.encoded_len();
        let payload = if len > PAYLOAD_BYTES {
            Store::put(&t.encode_to_vec())
        } else {
          ...
        };

Additionally, for Store::put(&t.encode_to_vec()) we've got

pub(crate) fn put(bytes: &[u8]) -> IdHash {
        // If length is less than that of a hash, this should have been inlined.
        debug_assert!(bytes.len() > core::mem::size_of::<IdHash>());
        let hash = Self::hash(bytes);
        STATIC_MAP.write().insert(hash, Vec::from(bytes));
        hash
    }

NB: It's only take_bytes that removes item from STATIC_MAP.

If pub(crate) fn take_bytes(id: &Id) -> Result<Vec<u8>, CanonError> is not called for all items, we can experience OOM due to constantly increasing STATIC_MAP size.

The issue is allegedly fixed in canonical 0.7.0

autholykos commented 2 years ago

See dusk-network/rusk#609 See dusk-network/rusk#606

herr-seppia commented 2 years ago

Memory keep increasing, slowly but steady

wip-20220228-node-0
%MEM   RSS
13.9 138536
-----
wip-20220228-node-1
%MEM   RSS
 7.5 75156
-----
wip-20220228-node-2
%MEM   RSS
 7.6 76548
-----
wip-20220228-node-4
%MEM   RSS
 7.5 74704
-----
wip-20220228-node-5
%MEM   RSS
 7.5 75156
-----
wip-20220228-node-6
%MEM   RSS
 7.3 73288
-----
wip-20220228-node-7
%MEM   RSS
 7.5 74648
-----
wip-20220228-node-8
%MEM   RSS
 6.3 63656
-----
wip-20220228-node-9
%MEM   RSS
26.0 258908
autholykos commented 2 years ago

Upon running a cluster with no rate-limit at all, with 1GB of memory per VPS, after an initial memory increase, at block height ~15.000, it then stabilizes at a constant ~350MB (at the time of writing this, the cluster is at 70,000 blocks).

Closing in favor of dusk-network/dusk-blockchain#1317