awnumar / memguard

Secure software enclave for storage of sensitive information in memory.
Apache License 2.0
2.49k stars 124 forks source link

open enclave dead lock #132

Closed wodadehencou closed 1 year ago

wodadehencou commented 4 years ago

Describe the bug Open enclave may cause RWMutex lock, caused by recursive rlock the RWMutex which is not permitted by golang.

in file: src/sync/rwmutex.go

// If a goroutine holds a RWMutex for reading and another goroutine might
// call Lock, no goroutine should expect to be able to acquire a read lock
// until the initial read lock is released. In particular, this prohibits
// recursive read locking. This is to ensure that the lock eventually becomes
// available; a blocked Lock call excludes new readers from acquiring the
// lock

To Reproduce Steps to reproduce the behaviour:

  1. create a testcase that open the enclave parallel, maybe 20 goroutines.
  2. the process will locked in about 5 minutes, at Rlock() in func (s *Coffer) Destroyed() bool {
  3. this problem occurs in OSX and Linux

Expected behaviour all process run normally

Screenshots If applicable, add screenshots to help explain your problem.

System (please complete the following information):

Additional context Use testcase below can generate the case. CPU usage start at about 100%, then decrease to 0%. now the process is locked.

package main

import (
    "bytes"
    "context"
    "crypto/rand"
    "errors"
    "fmt"
    "io"
    "os"
    "os/signal"
    "syscall"
    "time"

    "github.com/awnumar/memguard"
)

func main() {
    sigs := make(chan os.Signal, 1)
    signal.Notify(sigs, syscall.SIGINT, syscall.SIGTERM)

    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
    go func() {
        select {
        case <-sigs:
            cancel()
        }
    }()
    OpenEnclave(ctx)
}

func OpenEnclave(ctx context.Context) {
    n := 10
    data := make([][]byte, n)
    enclaves := make([]*memguard.Enclave, n)
    for i := range data {
        data[i] = make([]byte, 32)
        buf := make([]byte, 32)
        io.ReadFull(rand.Reader, buf)
        copy(data[i], buf)
        enclaves[i] = memguard.NewEnclave(buf)
    }

    threads := 20
    for i := 0; i < threads; i++ {
        j := 0
        go func(ctx context.Context) {
            for {
                select {
                case <-ctx.Done():
                    return
                default:
                    {
                        // fmt.Printf("open enclave %d \n", j)
                        immediateOpen(ctx, enclaves[j], data[j])
                        j = (j + 1) % n
                    }
                }
            }
        }(ctx)
    }
    <-ctx.Done()
    time.Sleep(time.Second)
}

func openVerify(lock *memguard.Enclave, exp []byte) error {
    lb, err := lock.Open()
    if err != nil {
        return err
    }
    defer lb.Destroy()
    if !bytes.Equal(lb.Bytes(), exp) {
        return errors.New("open verify fail")
    }
    return nil
}

func immediateOpen(ctx context.Context, lock *memguard.Enclave, exp []byte) {
    start := time.Now()
    c1 := make(chan error, 1)
    go func() {
        err := openVerify(lock, exp)
        c1 <- err
    }()
    var dur time.Duration
    select {
    case err := <-c1:
        {
            dur = time.Since(start)
            if err != nil {
                fmt.Printf("### open fail: %s \n", err)
            }
        }
    case <-ctx.Done():
        {
            dur = time.Since(start)
            fmt.Printf("### timeout \n")
        }
    }
    // fmt.Printf("%d, %d \n", start.UnixNano(), dur.Nanoseconds())
    _ = dur
}
awnumar commented 4 years ago

Thanks for reporting. 👍

awnumar commented 4 years ago

After applying the patch the behaviour seems identical. A bunch of timeouts after a period:

...
1587048652522845416, 36642766 
1587048652555571711, 3972773 
1587048652544642401, 14974528 
1587048652555173078, 4526586 
^C### timeout 
1587048652559703689, 27796593682 
### timeout 
1587048652559624289, 27796714889 
### timeout 
1587048652559552655, 27796801999 
...
wodadehencou commented 4 years ago

I can not get the problem on my platform (go1.14, OSX), could you @awnumar please add two lines codes at the end of function OpenEnclave, print out the stack information.

         // ...
    <-ctx.Done()
    time.Sleep(time.Second)

    buf := make([]byte, 1<<20)
    fmt.Println(string(buf[:runtime.Stack(buf, true)])
}

I read all the codes of memguard, there are some other place have the same bug, I will make a MR as soon as possible

awnumar commented 4 years ago

Here's the output: trace.txt go version go1.14.2 linux/amd64

wodadehencou commented 4 years ago

This is because the Purge() function, operation flow is enclave.Open -> Key.View --(get RLock)--> NewBuffer -> memcall.Alloc[Fail] -> Panic -> Purge -> key.Lock (request write lock)

This is another dead lock case. But I don't know why in your @awnumar system, memcall.Alloc was fail

A simple solution for all these dead lock cases, is to arrange all the Lock() & UnLock() carefully. But it will be less efficient, there will be more than ONE lock in a few of continuous operations.

awnumar commented 4 years ago

Thanks for looking into this and providing a patch. I have merged your PR and added you to the authors file.

I can no longer reproduce the deadlock on my system. Your PoC does however produce some panics that are very interesting. I have added it to the examples submodule to investigate when I have more time.

Thanks again, and keep safe :)