axone-protocol / axoned

⛓️ Axone blockchain πŸ’«
https://axone.xyz
Apache License 2.0
164 stars 128 forks source link

πŸ›‘οΈ High Memory Consumption leading to Network Outage #618

Closed ccamel closed 4 months ago

ccamel commented 6 months ago

[!NOTE] Severity: Critical target: v7.1.0 - Commit: 3c854270b006db30aa3894da2cdba10cc31b8c5f Ref: OKP4 Blockchain Audit Report v1.0 - 02-05-2024 - BlockApex

Description

While approaching the prolog code being executed on the okp4 blockchain, we used blackbox approach in which we tried various prolog predicates which were natively allowed in the blockchain. We found that if a large list is generated and then tried to be mapped, it would cause a high memory usage which may lead to termination of that specific blockchain instance. This issue was also verfied on itchiban/prolog's Scripting engine. If a large list i.e N = 10000000000 was passed it would cause high memory usage that the node instance is forced to terminate. It is interesting to note that on some value of N = 10000, the blockchain will gracefully handle the issue and return gas error.

Furthermore, it was tested that this issue is applicable for all predicates that may have a high memory consumption.

This finding is not isolated to our custom testing environments but was also corroborated using the Ichiban Prolog's scripting engine, highlighting a fundamental issue within the underlying computational logic that poses a serious risk to the blockchain's stability and operational integrity.

Impact

The ramifications of this vulnerability are twofold:

Recommandation

TBC

bdeneux commented 6 months ago

Below is my analysis and steps to reproduce the critical issue at hand. Let's open discussion about this issue and potential solutions to resolve it.

πŸ“ length/2 allocation issue

Here the simplified prolog query that cause hight memory consumption :

length(List, 10000000000), maplist(=(1), List).

Initially, I thought the issue was due to the recursive call of execution of all clauses through maplist/2. However, the recursion is not the problem since the logic module asks for prolog execution through a given context that handles the out of gas limit. The context is given to the Promise that handles each predicate and stops immediately when the gas limit is reached.

When calling length(List, 4), prolog will allocate a new Variable four times and unify it into the List :

https://github.com/axone-protocol/prolog/blob/d6ea3496c94deb1dbb755c4e2c78914507c67d0b/engine/builtin.go#L2926-L2935 :

func lengthRundown(vm *VM, list Variable, n Integer, k Cont, env *Env) *Promise {
    elems, err := makeSlice(int(n))
    if err != nil {
        return Error(resourceError(resourceMemory, env))
    }
    for i := range elems {
        elems[i] = NewVariable()
    }
    return Unify(vm, list, List(elems...), k, env)
}

The only limitation checked here is the memory limit depending on the node host caught by makeSlice(). Otherwise, for very large numbers, it takes time to allocate NewVariable() through the loop range and is not handled by the context limitation. This means, when we call this predicate through a context with a deadline :

func TestInterpreter_performance(t *testing.T) {
    var out bytes.Buffer
    p := New(nil, &out)

    tests := []struct {
        name     string
        query    string
        output   string
        outputFn func(t *testing.T, output string)
        err      error
        waits    bool
    }{
        {name: "1", query: `length(List, 10000000), maplist(=(1), List).`, err: context.DeadlineExceeded, waits: true},
        {name: "2", query: `length(List, 10000000000), maplist(=(1), List).`, err: context.DeadlineExceeded, waits: true},
    }

    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            ctx, cancel := context.WithTimeout(context.Background(), 10000*time.Millisecond)
            defer cancel()

            assert.Equal(t, tt.err, p.QuerySolutionContext(ctx, tt.query).Err())
            assert.Equal(t, tt.waits, errors.Is(ctx.Err(), context.DeadlineExceeded))
        })
    }
}

The first test case works since the deadline is handled by maplist, but the second one fails due to the test timeout and not handled by the context cancellation.

Workaround

I made a small workaround for testing purposes by adding the context to the VM and using it to check if the context is not done before continuing :

func lengthRundown(vm *VM, list Variable, n Integer, k Cont, env *Env) *Promise {
    elems, err := makeSlice(int(n))
    if err != nil {
        return Error(resourceError(resourceMemory, env))
    }
    for i := range elems {
        select {
        case <-vm.Ctx.Done():
            fmt.Println("TIME OUT")
            return Error(fmt.Errorf("time out"))
        default:
            elems[i] = NewVariable()
        }
    }
    return Unify(vm, list, List(elems...), k, env)
}

πŸ’Ύ Memory limit

Here is the snippet of code that is used for making a new slice by controlling available memory :

https://github.com/axone-protocol/prolog/blob/d6ea3496c94deb1dbb755c4e2c78914507c67d0b/engine/malloc.go#L14-L43

The calculation of the memory limit for making a new slice is not safe and will depend essentially on the host where the node runs. It's not safe because between the time when the available memory is checked (free := memFree()) and when the slice is allocated (make([]Term, n)), another part of the program could potentially allocate memory, causing the actual memory usage to exceed the limit. (In a blockchain node execution condition, this occurs multiple times on my side during my test). By the way, we cannot catch memory overflows : https://github.com/ichiban/prolog/issues/226.

πŸ”— References

πŸ“ Next steps

Let's discuss possible fixes we can implement πŸ˜…. @ccamel @amimart

ccamel commented 5 months ago

Thanks @bdeneux for your detailed analysis of the nature of the security problem revealed by the audit.

Regarding the length/2 allocation issue, we can't use the deadline pattern to solve it because it introduces unpredictability into the execution of requests. Depending on the complexity of the query and the current load on the system, the same query may sometimes succeed and sometimes fail, compromising the ability of the validation nodes to reach consensus.

In preliminary consideration, I believe the best approach would be to implement controls on the allocators (memory, variables, etc.) that:

  1. Ensure that the amount allocated of resources (mem, variables, etc) cannot exceed a maximum, which assures validators that allocated resources remain constrained at all times. This serves as the absolute limit.
  2. Account for this allocation by a proportional consumption of gas. This acts as the utilitarian limit.

That said, it's not easy, quite tricky (as you mentioned, https://github.com/ichiban/prolog/issues/226 is a good read) and you have to apply it to all the code, mainly in the ichiban/prolog library, but also to our native predicates, if necessary.

But I'd need a bit more time to get a better grasp of the issue and see if other approaches might not be more effective and safer.

amimart commented 5 months ago

@ccamel @bdeneux This is really tricky, I would propose a first far from ideal fix: prohibit unification of the length with a not initialized variable using the length/2 predicate, same approach could be use for the functor/3 predicate. This way we can still get the length of an array, but not initializing one..

What do you think?

bdeneux commented 4 months ago

@amimart Yes seems a good temporary solution but after working on https://github.com/axone-protocol/prolog/pull/3, I think it would be better (as a temporary solution too) if we add instead a maximum number of NewVariable possible. It will allow alway to use unification of length with uninitialized variable but with a defined limit, like say @ccamel in his first comment about allocated resources. This is a temporary solution since it act only on the variable resource and not on the global memory but it's a good starting point.

amimart commented 4 months ago

@amimart Yes seems a good temporary solution but after working on axone-protocol/prolog#3, I think it would be better (as a temporary solution too) if we add instead a maximum number of NewVariable possible. It will allow alway to use unification of length with uninitialized variable but with a defined limit, like say @ccamel in his first comment about allocated resources. This is a temporary solution since it act only on the variable resource and not on the global memory but it's a good starting point.

Totally agree with you that's way better πŸ‘, let's go this way