cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.05k stars 3.8k forks source link

perf investigation: unaligned load/store on arm64 #109110

Open srosenberg opened 1 year ago

srosenberg commented 1 year ago

Unaligned load/store on arm64 can lead to lower memory bandwidth and higher latency; e.g., see Go's memmove benchmarks before and after [1]. Correctness is likely not an issue unless performing 64-bit atomic operations on values which Go's compiler doesn't guarantee to be 64-bit aligned [2], [3]. The reason is rather subtle; it deserves a more detailed explanation.

Correctness Explanation

Atomic load/store on arm64 must be aligned (otherwise, SIGBUS is raised; see the writeup below). E.g., according to [4], the Store-Release instruction, ldaddal will fault on unaligned access,

For Load-Acquire, Load-AcquirePC, and Store-Release instructions, the address of the data object that is supplied must be aligned to the size of the data element that is being accessed. Otherwise, the access generates an Alignment fault.

Non-atomic load/store can be unaligned. Go's compiler guarantees basic alignment on struct fields and array elements [5]. E.g., if a struct has a field of type int64, then all the fields in the struct are 64-bit aligned (via padding) on arm64. However, they are not 64-bit aligned on arm32, hence the "bug" note in [3],

The first word in an allocated struct, array, or slice; in a global variable; or in a local variable (because the subject of all atomic operations will escape to the heap) can be relied upon to be 64-bit aligned.

In summary, 64-bit atomics on arm32 are without faults iff they follow the above guidelines, e.g., only reference the first field of a struct. Otherwise, you risk getting shot down by an unaligned load/store. On arm64, a fault could only happen due to uses of unsafe, since type-checked accesses are 64-bit aligned.

[1] https://github.com/golang/go/issues/40324 [2] https://github.com/golang/go/issues/23345 [3] https://pkg.go.dev/sync/atomic#pkg-note-BUG [4] https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-Store-Release-instructions [5] https://go.dev/ref/spec#Size_and_alignment_guarantees

Jira issue: CRDB-30784

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/test-eng

srosenberg commented 1 year ago

Preliminary Investigation

Graviton2 (not yet graviton3) exposes PMU counters for unaligned load/store, namely unaligned_ld_spec and unaligned_st_spec. We run a TPCC-C workload to get a count of unaligned accesses relative to all memory accesses, namely mem_access_rd and mem_access_wr.

Provision

roachprod create -n1 --clouds aws --aws-machine-type c6g.12xlarge --local-ssd=false stan-test
roachprod stage stan-test release v22.2.13
roachprod stage stan-test workload --arch arm64 --os linux

Run 1-node Cluster

perf stat -e mem_access_rd,mem_access_wr,unaligned_ld_spec,unaligned_st_spec ./cockroach start-single-node --insecure

 Performance counter stats for './cockroach start-single-node --insecure':

     1314797387351      mem_access_rd                                                 (49.95%)
      829265356569      mem_access_wr                                                 (49.99%)
       23037710432      unaligned_ld_spec                                             (50.06%)
       16752809887      unaligned_st_spec                                             (50.01%)

     779.429239436 seconds time elapsed

Load Data

perf stat -e mem_access_rd,mem_access_wr,unaligned_ld_spec,unaligned_st_spec ./workload_before fixtures import tpcc  --warehouses=100  'postgres://root@localhost:26257?sslmode=disable'

Performance counter stats for './workload_before fixtures import tpcc --warehouses=100 postgres://root@localhost:26257?sslmode=disable':

         219406219      mem_access_rd                                                 (56.73%)
         138251994      mem_access_wr                                                 (50.49%)
           1305497      unaligned_ld_spec                                             (50.73%)
            809740      unaligned_st_spec                                             (51.35%)

     107.547993203 seconds time elapsed

Run TPCC-C

perf stat -e mem_access_rd,mem_access_wr,unaligned_ld_spec,unaligned_st_spec ./workload run tpcc --warehouses=100 --ramp=1m --duration=5m

 Performance counter stats for './workload run tpcc --warehouses=100 --ramp=1m --duration=5m':

        5535386284      mem_access_rd                                                 (49.72%)
        3023814042      mem_access_wr                                                 (50.35%)
          20462041      unaligned_ld_spec                                             (50.41%)
          15644317      unaligned_st_spec                                             (49.76%)

     360.307078331 seconds time elapsed

Summary

For the database, we see ~1.75% of loads and ~2% of stores are unaligned. In the case of the workload, the numbers fall under 1%. (Intuitively, workload doesn't move as much data in memory as does the database.)

We also repeated the above workload steps with the workload binary prior to the change in [1]. There was no (statistical) difference in unaligned_ld_spec or unaligned_st_spec.

[1] https://github.com/cockroachdb/cockroach/pull/108400

srosenberg commented 1 year ago

To illustrate what happens when an atomic store faults on graviton2, consider the following example using unsafe. The 10th element is unaligned. Since we know that Go's compiler guarantees the first element to be 64-bit aligned, an element, i, is 64-bit aligned iff i = 0 mod 8. Executing the code below on graviton2 will fault with SIGBUS.

package main

import (
        "fmt"
        "sync/atomic"
        "math/rand"
        "unsafe"
)

func main() {
        buf := [100000]byte{}

        for i := 1; i < len(buf); i++ {
                // N.B. will cause SIGBUS owing to unaligned atomic store
                atomic.AddInt64((*int64)(unsafe.Pointer(&buf[9])), int64(rand.Intn(100)))
                // N.B. uncomment to prevent SIGBUG
                // atomic.AddInt64((*int64)(unsafe.Pointer(&buf[8])), int64(rand.Intn(100)))
        }
        fmt.Println("after:", buf[9])
}

Executing the above via gdb, we can see the faulted instruction below,

sigbus_ldaddal