code-423n4 / 2024-03-phala-network-findings

0 stars 0 forks source link

Local cache will enter into thrasing under heavy workloads #27

Closed c4-bot-3 closed 3 months ago

c4-bot-3 commented 3 months ago

Lines of code

https://github.com/code-423n4/2024-03-phala-network/blob/a01ffbe992560d8d0f17deadfb9b9a2bed38377e/phala-blockchain/crates/pink/chain-extension/src/local_cache.rs#L176-L183

Vulnerability details

Impact

Using a FIFO for the design of the local cache will lead to thrasing under heavy workloads, as the garbage collector blindly treats all items in the cache the same way regardless of their "usage" by the runtime. For example, it does not take into account the number of previous access to a certain item nor if it was accessed recently (which means the odds to access them again will be higher than accessing the other ones). Because of that, it will lead to a bottleneck where the runtime will be spending too much time "cleaning" the local cache rather than doing actual computations.

Description

Instead of showing a standalone code snippet and that's it, I feel the best way to explain this flaw is first with an equivalent simplified example. Say we have a local cache which is implemented as a vector whose maximum size is 3 (your impl is a BTree, so the simile is valid). Now, the runtime will be calculating

$$\sum_{i=1}^{n}2i$$

with the results of:

being cached from the local cache. As we have $4$ "variables", we can't fit them all in the local cache at a time, so the garbage collector will be called by the runtime pretty often to remove the "old" ones. For example, let's run our example for $n=3$:

  1. Load i = 1 in the first cell
  2. Load 2i = 2 in the second cell
  3. Load n = 3 in the third cell, check the ending condition and continue
  4. GC and remove the first cell (i = 1)
  5. Load i = 2 in the first cell
  6. Compute 2i + second cell = 2 * 2 + 2 = 6 and call GC to remove the second cell
  7. Load 6 in the second cell
  8. As n is already in the third cell from step 3., just , check the ending condition and continue
  9. GC and remove the THIRD cell, as it is the oldest one (remember, it came from step 3., and the other cells came from 5. and 6. respectively)
  10. Load i = 3 in the third cell
  11. GC and remove the first cell (i = 2)
  12. Load 2i = 6 in the first cell
  13. As the accumulated value is already in the second cell from step 6., compute 2i + second cell = 2 * 3 + 6 = 12 and call GC to remove the second cell
  14. Load 12 in the second cell
  15. GC and remove the third cell
  16. Load n in the third cell
  17. GC and remove the first cell
  18. Load i = 4 in the first cell, check the ending condition and break

The illustration is the following one:

        --------------------------------------------------------------
cell 1 | i = 1  | i = 1  | i = 1  |   GC   | i = 2  | i = 2  | i = 2  |
        --------------------------------------------------------------
cell 2 |        | 2i = 2 | 2i = 2 | 2i = 2 | 2i = 2 |   GC   | 2i = 6 |
        --------------------------------------------------------------
cell 3 |        |        | n = 3  | n = 3  | n = 3  | n = 3  | n = 3  |
        --------------------------------------------------------------
                1.       2.       3.       4.       5.       6.       7.

        ---------------------------------------------------------------
cell 1 | i = 2  | i = 2  | i = 2  |   GC   | 2i = 6 | 2i = 6 | 2i = 6  |
        ---------------------------------------------------------------
cell 2 | 2i = 6 | 2i = 6 | 2i = 6 | 2i = 6 | 2i = 6 |   GC   | 2i = 12 |
        ---------------------------------------------------------------
cell 3 | n = 3  |   GC   | i = 3  | i = 3  | i = 3  | i = 3  | i = 3   |
        ---------------------------------------------------------------
                8.       9.      10.      11.      12.      13.       14.

        ---------------------------------------
cell 1 | 2i = 6  | 2i = 6  |   GC    |  i = 4  |
        ---------------------------------------
cell 2 | 2i = 12 | 2i = 12 | 2i = 12 | 2i = 12 |
        ---------------------------------------
cell 3 |   GC    |  n = 3  |  n = 3  |  n = 3  |
        ---------------------------------------
                15.       16.       17.       18.

For our example, the runtime spent 7 "ticks" doing garbage collection routines out of 18, that is, around 7 / 18 = 0.3888... or 38% of the time. The example has been deliberately designed to show the poor performance of the algorithm under a seemingly arbitrary example. As the runtime will be executing arbitrary code much more complex than this, the performance could be far worse, which is not desirable in a production-ready environment.

Proof of Concept

In local_cache.rs, it is defined a routine called fit_size which "removes the items closest to the expiration date until the size can be fit into max_size". If we add the fact that when we are adding an item to the cache we are assigning it a default lifetime in

local_cache::LocalCache impl, line 200

    pub fn set(
        &mut self,
        id: Cow<[u8]>,
        key: Cow<[u8]>,
        value: Cow<[u8]>,
    ) -> Result<(), StorageQuotaExceeded> {
        self.maybe_clear_expired();
        self.storages
            .get_mut(id.as_ref())
            .ok_or(StorageQuotaExceeded)?
            .set(key, value, self.default_value_lifetime)
    }

local_cache::Storage impl, line 120

    fn set(
        &mut self,
        key: Cow<[u8]>,
        value: Cow<[u8]>,
        lifetime: u64,
    ) -> Result<(), StorageQuotaExceeded> {

        ...

        self.kvs.insert(
            key.into_owned(),
            StorageValue {
                expire_at: now().saturating_add(lifetime),
                value: value.into_owned(),
            },
        );
        Ok(())
    }

and those additions are atomic, then we have in place a vanilla FIFO implementation, with the thrasing and poor performance mentioned above. As I explained myself in the previous section, I will not deepen much more on that. The flaw is in the use of a FIFO algorithm for the management of the local cache. If it were a good choice, then modern OOSS and VMs would rely on it for their memory or tasks management routines (which is not the case since a few decades).

Recommended Mitigation Steps

The obvious solution would be increasing the size of the local cache, but the needed threshold would be arbitrary and you would have dangling items staying idle on the local cache for a long time. I would recommend implementing a variation of the algorithm in place, that is, when the runtime access a certain item in the local cache, increase its expiration by a certain amount. That way, frequent accesses to the cache like in loops or, for example, in our example above were we needed n in 4 situations, n would have had an expiration higher than the rest of the variables so the runtime would have been able to query it directly instead of bringing it back to the local cache all the time (as the garbage collector would have removed the other ones instead of n). For example, a fix would be as simple as:

local_cache::LocalCache impl, line 181

+   const BONUS: u64 = 10; // whatever 

-   pub fn get(&self, id: &[u8], key: &[u8]) -> Option<Vec<u8>> {
-       let entry = self.storages.get(id)?.kvs.get(key)?;
+   pub fn get(&mut self, id: &[u8], key: &[u8]) -> Option<Vec<u8>> {
+       let entry = self.storages.get_mut(id)?.kvs.get_mut(key)?;

        if entry.expire_at <= now() {
            None
        } else {
+           entry.expire_at += BONUS;
            Some(entry.value.to_owned())
        }
    }

and the test that verifies the correct behavior of this modifications:

    pub fn get_test(&mut self, id: &[u8], key: &[u8]) -> Option<u64> {
        let entry = self.storages.get_mut(id)?.kvs.get_mut(key)?;
        Some(entry.expire_at)
    }

    #[test]
    fn new_variation_fifo() {
        let mut cache = test_cache();
        cache.apply_quotas([(&b"id"[..], 1000)]);
        let _ = cache.set(cow(b"id"), cow(b"foo"), cow(b"value"));
        let _ = cache.get(&cow(b"id"), &cow(b"foo"));
        let exp = cache.get_test(&cow(b"id"), &cow(b"foo")).unwrap_or_else(|| 0xAAA);

        assert_eq!(exp, now() + cache.default_value_lifetime + BONUS);     
    }

Assessed type

Other

c4-pre-sort commented 3 months ago

141345 marked the issue as primary issue

c4-pre-sort commented 3 months ago

141345 marked the issue as sufficient quality report

141345 commented 3 months ago

memory and cache management issue

c4-sponsor commented 3 months ago

kvinwang (sponsor) confirmed

OpenCoreCH commented 3 months ago

The warden has demonstrated nicely why the chosen cache algorithm can lead to poor performance, therefore affecting the runtime. While the chosen PoC is only a toy example, this is to be expected in this case, because evaluating cache algorithms and the particular choice of one is an active area of research and very involved.

c4-judge commented 3 months ago

OpenCoreCH marked the issue as selected for report

MarioPoneder commented 3 months ago

Adding a further perspective:

The method fit_size (private), which implicitly creates the FIFO behaviour, must be explicit invoked by changing cache quotas (set new max_size) through apply_quotas (public).

The set method only invokes clear_expired and fails with StorageQuotaExceeded if not enough storage is available after clearing expired entries.
Therefore, under heavy-load get and set operations, without continuously resizing the cache (apply_quotas -> fit_size), there is no FIFO behaviour but rather a DoS scenario (StorageQuotaExceeded) once the cache is full. See also default_value_lifetime which is 1 week.

Edit (based on another comment below): It's correct that no panic will occur, just an error when the cache is full.

nethoxa commented 3 months ago

Hmmm, very interesting @MarioPoneder. I overlooked the fit_size function definition and thought it was public (my bad). It is as you says, there is no way to call it unless you call apply_quotas. That means under heavy operations it would DOS the local cache, that's right.

Doing a quick ctrl-f over the in-scope repo, it's never called from something that's not a test, so, if possible, I would like to hear from the sponsors @c4-sponsor what they think. If apply_quotas, and so fit_size by extension, is never called within other parts of the system that are not in scope for this contest, it would be more dangerous than what we initially thought.

DadeKuma commented 3 months ago

My take on this:

I agree with @MarioPoneder minus the DoS part. Note that the function will not panic even if the cache is full (and thus it won't crash the worker), but it will return a StorageQuotaExceeded Error instead, which can be consumed by the caller.

Supposing a full cache, the alternative to returning an Error would be to override a non-expired entry instead, which might be considered equally bad, or worse.

I think this is a design decision as the implementer can choose what to do. If the cache is full, an error will be returned, and the caller may decide to call remove to reduce the cache size, and then retry again.

nethoxa commented 3 months ago

Yeah, thanks for your words @DadeKuma. That's why I asked the sponsor to give some light about this, as the caller you mention is not within the code in-scope for this contest. Moreover, calling apply_quotas all the time the cache is full under heavy workloads will for sure slow down the performance of the cache, which directly impacts such of the system that makes use of it. This seems to be like a grey area in the C4 ruling we are used to, in my opinion.

kvinwang commented 3 months ago

Thank you, @MarioPoneder @DadeKuma @nethoxa, for reviewing this finding. I apologize for assuming that fit_size would be invoked in the set function during confirmation. However, this is not the case. The apply_quotas function is called within the worker when there's a change in on-chain deposits to a contract. It is not triggered by off-chain workload. Only functions affected by the chain extension defined here should be considered as driven by off-chain workload.

c4-sponsor commented 3 months ago

kvinwang (sponsor) disputed

nethoxa commented 3 months ago

Hey @kvinwang, thanks so much for the clarification. Isn't it worse, then? If the off-chain workload is executed and fills the cache, the only way to "clean" it is via a deposit to a contract. Like, for the shake of the example, let's say the cache has a size of 3 and there are 4 task and one deposit to be done like follows (TTTTD) and say one task consumes one slot of the cache. Then, as you only call apply_quotas -> fit_size on deposit D, once the runtime reaches the third task, the cache will revert as it is full and you can NOT "clean" it unless a deposit is done (based on the initial assumption of heavy workloads, so that the items in the cache are not expired yet).

Neverthless, the initial flaw I mentioned in my initial submission remains, that is, it should not be the case to use a FIFO-like algorithm for the design of a production-ready cache (under-the-hood it is what it is, you are removing the older ones first via either set -> clear_expired(now()) or via apply_quotas -> fit_cache)

Edit: for reference, the place where apply_quotas is called is here

kvinwang commented 3 months ago

Hey @kvinwang, thanks so much for the clarification. Isn't it worse, then? If the off-chain workload is executed and fills the cache, the only way to "clean" it is via a deposit to a contract. Like, for the shake of the example, let's say the cache has a size of 3 and there are 4 task and one deposit to be done like follows (TTTTD) and say one task consumes one slot of the cache. Then, as you only call apply_quotas -> fit_size on deposit D, once the runtime reaches the third task, the cache will revert as it is full and you can NOT "clean" it unless a deposit is done (based on the initial assumption of heavy workloads, so that the items in the cache are not expired yet).

Neverthless, the initial flaw I mentioned in my initial submission remains, that is, it should not be the case to use a FIFO-like algorithm for the design of a production-ready cache (under-the-hood it is what it is, you are removing the older ones first via either set -> clear_expired(now()) or via apply_quotas -> fit_cache)

Based on the initial assumption of heavy workloads, there is no longer thrashing, but rather returning an error on set. This aligns with the design decision, as pointed out by @DadeKuma.

In such a situation, the contract developer would have two options:

OpenCoreCH commented 3 months ago

Thanks all for the further input! Based on the discussion, thrashing won't occur, but it will rather return an error, which is the desired (and probably best) behaviour in such situations.

c4-judge commented 3 months ago

OpenCoreCH marked the issue as nullified