prefetch instruction - Githubissues

fbarchard commented 4 years ago

Could you add a prefetch instruction for WASM? The syntax would be similar to a load

The load instruction uses a local.get for the base pointer, and an immediate offset:, and then puts the result back on the stack.

  local.get    1        # src
  f32.load     0        # load float from src
  local.set    4

a prefetch has a base and offset, but no result

  local.get        1        # src
  i32.prefetch   448     # prefetch src with offset of 448 bytes

On arm64 this implements with prfm pldl1keep, [src, 448] On x64 prefetcht0 448(src)

Performance improvement varies by cpu and function. On Cortex A53 a typical speed up is 20%. On Atom Cederview, an ssse3 function to convert rgb to grey scale speeds up 9%.

rrwinterton commented 4 years ago

Wonder if this instruction could be similar to what we talked about with the instrumentation instruction. Maybe specialty instructions. Not sure we want to group them together but effectively is supported will execute. If not supported will be ignored. The instrumentation instruction takes a byte as an immediate. This would probably take a i32 as an immediate?

tlively commented 4 years ago

Yes, this is another instance of a proposed nop instruction with unmodeled side effects, just like the ITT proposal.

@fbarchard How would you expect prefetch instructions to be ordered with respect to other instructions in an optimizing WebAssembly engine? You wouldn't want it to be moved around too much, but if you don't let anything move past it, it might inhibit optimizations and no longer be beneficial.

rrwinterton commented 4 years ago

I can let frank reply too. As you said one or two instruction move in the prefetch may not matter similar to the ITT proposal. We just wouldn't want it to be moved inside a nested loop block. loop before is okay. loop after would be bad if that makes sense?

fbarchard commented 4 years ago

There are 2 cases to test performance on

on cpus and data where prefetch is a win. In general, larger data, older cpu, and more complex functions benefit the most. For example a guassian blur 5x5 kernel is 40% faster on cortex a53, using 5 prefetches. Its a little bit complex fetches (5 rows), and some multiply accumulates math.

The location of the prefetch doesnt matter all that much in this case. It behaves much like a load and can be scheduled anywhere you put loads... preferably during some math. At the end of the loop is fine. Optimal appears to be somewhere after the loads.

avoid hurting performance in the use case where the data is already in cache, the cpu has enormous caches, or the modern cpu is pretty good about speculative prefetching, you expect prefetch to not help, and worst case get in the way of useful instructions.

The location of the prefetch instruction matters more here. On ARM you can do up to 3 math instructions per cycle, and 1 prefetch. So you schedule them after math so they co-issue for free. Preferably a high latency math instruction.

Here's the example of gauss (from libyuv) with prefetch on arm. I've scheduled them after math instructions

// filter 5 rows with 1, 4, 6, 4, 1 coefficients to produce 1 row.
void GaussCol_NEON(const uint16_t* src0,
                   const uint16_t* src1,
                   const uint16_t* src2,
                   const uint16_t* src3,
                   const uint16_t* src4,
                   uint32_t* dst,
                   int width) {
  asm volatile(
      "movi       v6.8h, #4                      \n"  // constant 4
      "movi       v7.8h, #6                      \n"  // constant 6

      "1:                                        \n"
      "ld1        {v1.8h}, [%0], #16             \n"  // load 8 samples, 5 rows
      "ld1        {v2.8h}, [%4], #16             \n"
      "uaddl      v0.4s, v1.4h, v2.4h            \n"  // * 1
      "prfm       pldl1keep, [%0, 448]           \n"
      "uaddl2     v1.4s, v1.8h, v2.8h            \n"  // * 1
      "ld1        {v2.8h}, [%1], #16             \n"
      "umlal      v0.4s, v2.4h, v6.4h            \n"  // * 4
      "prfm       pldl1keep, [%1, 448]           \n"
      "umlal2     v1.4s, v2.8h, v6.8h            \n"  // * 4
      "ld1        {v2.8h}, [%2], #16             \n"
      "umlal      v0.4s, v2.4h, v7.4h            \n"  // * 6
      "prfm       pldl1keep, [%2, 448]           \n"
      "umlal2     v1.4s, v2.8h, v7.8h            \n"  // * 6
      "ld1        {v2.8h}, [%3], #16             \n"
      "umlal      v0.4s, v2.4h, v6.4h            \n"  // * 4
      "prfm       pldl1keep, [%3, 448]           \n"
      "umlal2     v1.4s, v2.8h, v6.8h            \n"  // * 4
      "subs       %w6, %w6, #8                   \n"  // 8 processed per loop
      "st1        {v0.4s,v1.4s}, [%5], #32       \n"  // store 8 samples
      "prfm       pldl1keep, [%4, 448]           \n"
      "b.gt       1b                             \n"
      : "+r"(src0),  // %0
        "+r"(src1),  // %1
        "+r"(src2),  // %2
        "+r"(src3),  // %3
        "+r"(src4),  // %4
        "+r"(dst),   // %5
        "+r"(width)  // %6
      :
      : "cc", "memory", "v0", "v1", "v2", "v6", "v7");
}

rrwinterton commented 4 years ago

I think what Thomas is concerned with his when you put a prefetch instruction it may skid past where you may think it will be in the machine code? As long as it doesn't fall inside a code path that is an address that is jumped to I think it would be okay. If this makes sense? If the prefetch falls below an address with a target jump the prefetch may end up in a loop we don't want it to prefetch e.g. a nested loop.

fbarchard commented 4 years ago

Here's an assembly function where I add a proposed WASM prefetch instruction. I picked a place in the inner loop where the instruction should issue for free without stalls and be most effective, while having no negative impact on data that is already in cache.

# Copyright 2020 Google LLC
#
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree.

# void xnn_f32_relu_ukernel__wasm32_shr_x1(
#     size_t n,             0
#     const float* x,       1
#     float* y,             2
#     const union params)   3 unused

# locals
#     float v               4
#     float mask            5

    .text
    .section    .text.xnn_f32_relu_ukernel__wasm32_shr_x1,"",@
    .hidden     xnn_f32_relu_ukernel__wasm32_shr_x1
    .globl      xnn_f32_relu_ukernel__wasm32_shr_x1
    .type       xnn_f32_relu_ukernel__wasm32_shr_x1,@function

xnn_f32_relu_ukernel__wasm32_shr_x1:
    .functype   xnn_f32_relu_ukernel__wasm32_shr_x1 (i32, i32, i32, i32) -> ()
    .local      i32, i32  # 4 - value, 5 - mask

    loop

      local.get    1        # src
      i32.load     0        # load float from src
      local.set    4

      local.get    1        # src += 4
      i32.const    4
      i32.add
      local.set    1

      local.get    4        # (v >> 31) - 1) & v
      i32.const    31
      i32.shr_u
      local.set    5

      local.get    5
      i32.const    -1
      i32.add
      local.set    5

      local.get    1        # src
      prefetch     448      # prefetch from src + 448 (7 cache lines)

      local.get    4
      local.get    5
      i32.and
      local.set    4

      local.get    2        # dst
      local.get    4
      i32.store    0        # store float

      local.get    2        # dst += 4
      i32.const    4
      i32.add
      local.set    2

      local.get    0
      i32.const    -4
      i32.add              # count -= 4
      local.set    0

      local.get    0
      i32.const    0       # count > 0
      i32.gt_s
      br_if        0       # loop

    end_loop
    end_function
`

sunfishcode commented 4 years ago

How did you determine that 7 is a good number of cache lines ahead to prefetch at?

There are several ways that explicit prefetches can end up making programs slower. Can you discuss what guidance we might be able to give wasm producers and developers as to how to use prefetches to get reliable speedups?

fbarchard commented 4 years ago

In libyuv I inserted 223 prfm instructions all aarch64 functions and tried versions with each offset and then ran on all 64 bit cpus. 448 was the fastest overall. For the initial test I used 720p, a realistic but larger size. Initially I put it at the end of loop, which is not a bad choice. Later I tuned the location of the prefetch to minimize overhead when running on smaller images. The biggest wins are functions a bit of matrix math but non-trivial memory where the math should be the bottleneck, and prefetch helps achieve that. The offset used should really be a function of how many bytes your loop consumes. Typical loops consume 2-4 vectors. A function that consumes less could have an offset of 128. Its not all that critical though... 448 would achieve the same speed on larger images where it works best. Neither makes a difference on cached images. A smaller offset helps mid sized images.

Here's an example

void ARGBToYRow_NEON(const uint8_t* src_argb, uint8_t* dst_y, int width) {
  asm volatile(
      "movi       v4.8b, #25                     \n"  // B * 0.1016 coefficient
      "movi       v5.8b, #129                    \n"  // G * 0.5078 coefficient
      "movi       v6.8b, #66                     \n"  // R * 0.2578 coefficient
      "movi       v7.8b, #16                     \n"  // Add 16 constant
      "1:                                        \n"
      "ld4        {v0.8b,v1.8b,v2.8b,v3.8b}, [%0], #32 \n"  // load 8 ARGB
      "prfm       pldl1keep, [%0, 448]           \n"
      "subs       %w2, %w2, #8                   \n"  // 8 processed per loop.
      "umull      v3.8h, v0.8b, v4.8b            \n"  // B
      "umlal      v3.8h, v1.8b, v5.8b            \n"  // G
      "umlal      v3.8h, v2.8b, v6.8b            \n"  // R
      "uqrshrn    v0.8b, v3.8h, #8               \n"  // 16 bit to 8 bit Y
      "uqadd      v0.8b, v0.8b, v7.8b            \n"
      "st1        {v0.8b}, [%1], #8              \n"  // store 8 pixels Y.
      "b.gt       1b                             \n"
      : "+r"(src_argb),  // %0
        "+r"(dst_y),     // %1
        "+r"(width)      // %2
      :
      : "cc", "memory", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7");
}

I tested the other forms of prefetch for arm and only got small wins on prefetching destination and streaming and L2. L1 KEEP was the useful variation on ARM.

The location is not critical for when it is effective, just when it is not. Much like a nop, you need to be able to schedule it somewhere that has no impact. After all loads, during math is a good location.

I'd prefer the v8 runtime compiler not reorder, so we can try the instruction in different locations. Complex functions will need several prefetches and they should be scheduled based on memory order and when cycles are available. Memory order is important, and doing a prefetch allows control over the memory order to be decoupled from the loads, so the loads can be scheduled according to math efficiency.

WebAssembly / design

prefetch instruction #1364