Open fbarchard opened 4 years ago
Wonder if this instruction could be similar to what we talked about with the instrumentation instruction. Maybe specialty instructions. Not sure we want to group them together but effectively is supported will execute. If not supported will be ignored. The instrumentation instruction takes a byte as an immediate. This would probably take a i32 as an immediate?
Yes, this is another instance of a proposed nop instruction with unmodeled side effects, just like the ITT proposal.
@fbarchard How would you expect prefetch instructions to be ordered with respect to other instructions in an optimizing WebAssembly engine? You wouldn't want it to be moved around too much, but if you don't let anything move past it, it might inhibit optimizations and no longer be beneficial.
I can let frank reply too. As you said one or two instruction move in the prefetch may not matter similar to the ITT proposal. We just wouldn't want it to be moved inside a nested loop block. loop before is okay. loop after would be bad if that makes sense?
There are 2 cases to test performance on
The location of the prefetch doesnt matter all that much in this case. It behaves much like a load and can be scheduled anywhere you put loads... preferably during some math. At the end of the loop is fine. Optimal appears to be somewhere after the loads.
The location of the prefetch instruction matters more here. On ARM you can do up to 3 math instructions per cycle, and 1 prefetch. So you schedule them after math so they co-issue for free. Preferably a high latency math instruction.
Here's the example of gauss (from libyuv) with prefetch on arm. I've scheduled them after math instructions
// filter 5 rows with 1, 4, 6, 4, 1 coefficients to produce 1 row.
void GaussCol_NEON(const uint16_t* src0,
const uint16_t* src1,
const uint16_t* src2,
const uint16_t* src3,
const uint16_t* src4,
uint32_t* dst,
int width) {
asm volatile(
"movi v6.8h, #4 \n" // constant 4
"movi v7.8h, #6 \n" // constant 6
"1: \n"
"ld1 {v1.8h}, [%0], #16 \n" // load 8 samples, 5 rows
"ld1 {v2.8h}, [%4], #16 \n"
"uaddl v0.4s, v1.4h, v2.4h \n" // * 1
"prfm pldl1keep, [%0, 448] \n"
"uaddl2 v1.4s, v1.8h, v2.8h \n" // * 1
"ld1 {v2.8h}, [%1], #16 \n"
"umlal v0.4s, v2.4h, v6.4h \n" // * 4
"prfm pldl1keep, [%1, 448] \n"
"umlal2 v1.4s, v2.8h, v6.8h \n" // * 4
"ld1 {v2.8h}, [%2], #16 \n"
"umlal v0.4s, v2.4h, v7.4h \n" // * 6
"prfm pldl1keep, [%2, 448] \n"
"umlal2 v1.4s, v2.8h, v7.8h \n" // * 6
"ld1 {v2.8h}, [%3], #16 \n"
"umlal v0.4s, v2.4h, v6.4h \n" // * 4
"prfm pldl1keep, [%3, 448] \n"
"umlal2 v1.4s, v2.8h, v6.8h \n" // * 4
"subs %w6, %w6, #8 \n" // 8 processed per loop
"st1 {v0.4s,v1.4s}, [%5], #32 \n" // store 8 samples
"prfm pldl1keep, [%4, 448] \n"
"b.gt 1b \n"
: "+r"(src0), // %0
"+r"(src1), // %1
"+r"(src2), // %2
"+r"(src3), // %3
"+r"(src4), // %4
"+r"(dst), // %5
"+r"(width) // %6
:
: "cc", "memory", "v0", "v1", "v2", "v6", "v7");
}
I think what Thomas is concerned with his when you put a prefetch instruction it may skid past where you may think it will be in the machine code? As long as it doesn't fall inside a code path that is an address that is jumped to I think it would be okay. If this makes sense? If the prefetch falls below an address with a target jump the prefetch may end up in a loop we don't want it to prefetch e.g. a nested loop.
Here's an assembly function where I add a proposed WASM prefetch instruction. I picked a place in the inner loop where the instruction should issue for free without stalls and be most effective, while having no negative impact on data that is already in cache.
# Copyright 2020 Google LLC
#
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree.
# void xnn_f32_relu_ukernel__wasm32_shr_x1(
# size_t n, 0
# const float* x, 1
# float* y, 2
# const union params) 3 unused
# locals
# float v 4
# float mask 5
.text
.section .text.xnn_f32_relu_ukernel__wasm32_shr_x1,"",@
.hidden xnn_f32_relu_ukernel__wasm32_shr_x1
.globl xnn_f32_relu_ukernel__wasm32_shr_x1
.type xnn_f32_relu_ukernel__wasm32_shr_x1,@function
xnn_f32_relu_ukernel__wasm32_shr_x1:
.functype xnn_f32_relu_ukernel__wasm32_shr_x1 (i32, i32, i32, i32) -> ()
.local i32, i32 # 4 - value, 5 - mask
loop
local.get 1 # src
i32.load 0 # load float from src
local.set 4
local.get 1 # src += 4
i32.const 4
i32.add
local.set 1
local.get 4 # (v >> 31) - 1) & v
i32.const 31
i32.shr_u
local.set 5
local.get 5
i32.const -1
i32.add
local.set 5
local.get 1 # src
prefetch 448 # prefetch from src + 448 (7 cache lines)
local.get 4
local.get 5
i32.and
local.set 4
local.get 2 # dst
local.get 4
i32.store 0 # store float
local.get 2 # dst += 4
i32.const 4
i32.add
local.set 2
local.get 0
i32.const -4
i32.add # count -= 4
local.set 0
local.get 0
i32.const 0 # count > 0
i32.gt_s
br_if 0 # loop
end_loop
end_function
`
How did you determine that 7 is a good number of cache lines ahead to prefetch at?
There are several ways that explicit prefetches can end up making programs slower. Can you discuss what guidance we might be able to give wasm producers and developers as to how to use prefetches to get reliable speedups?
In libyuv I inserted 223 prfm instructions all aarch64 functions and tried versions with each offset and then ran on all 64 bit cpus. 448 was the fastest overall. For the initial test I used 720p, a realistic but larger size. Initially I put it at the end of loop, which is not a bad choice. Later I tuned the location of the prefetch to minimize overhead when running on smaller images. The biggest wins are functions a bit of matrix math but non-trivial memory where the math should be the bottleneck, and prefetch helps achieve that. The offset used should really be a function of how many bytes your loop consumes. Typical loops consume 2-4 vectors. A function that consumes less could have an offset of 128. Its not all that critical though... 448 would achieve the same speed on larger images where it works best. Neither makes a difference on cached images. A smaller offset helps mid sized images.
Here's an example
void ARGBToYRow_NEON(const uint8_t* src_argb, uint8_t* dst_y, int width) {
asm volatile(
"movi v4.8b, #25 \n" // B * 0.1016 coefficient
"movi v5.8b, #129 \n" // G * 0.5078 coefficient
"movi v6.8b, #66 \n" // R * 0.2578 coefficient
"movi v7.8b, #16 \n" // Add 16 constant
"1: \n"
"ld4 {v0.8b,v1.8b,v2.8b,v3.8b}, [%0], #32 \n" // load 8 ARGB
"prfm pldl1keep, [%0, 448] \n"
"subs %w2, %w2, #8 \n" // 8 processed per loop.
"umull v3.8h, v0.8b, v4.8b \n" // B
"umlal v3.8h, v1.8b, v5.8b \n" // G
"umlal v3.8h, v2.8b, v6.8b \n" // R
"uqrshrn v0.8b, v3.8h, #8 \n" // 16 bit to 8 bit Y
"uqadd v0.8b, v0.8b, v7.8b \n"
"st1 {v0.8b}, [%1], #8 \n" // store 8 pixels Y.
"b.gt 1b \n"
: "+r"(src_argb), // %0
"+r"(dst_y), // %1
"+r"(width) // %2
:
: "cc", "memory", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7");
}
I tested the other forms of prefetch for arm and only got small wins on prefetching destination and streaming and L2. L1 KEEP was the useful variation on ARM.
The location is not critical for when it is effective, just when it is not. Much like a nop, you need to be able to schedule it somewhere that has no impact. After all loads, during math is a good location.
I'd prefer the v8 runtime compiler not reorder, so we can try the instruction in different locations. Complex functions will need several prefetches and they should be scheduled based on memory order and when cycles are available. Memory order is important, and doing a prefetch allows control over the memory order to be decoupled from the loads, so the loads can be scheduled according to math efficiency.
Could you add a prefetch instruction for WASM? The syntax would be similar to a load
The load instruction uses a local.get for the base pointer, and an immediate offset:, and then puts the result back on the stack.
a prefetch has a base and offset, but no result
On arm64 this implements with prfm pldl1keep, [src, 448] On x64 prefetcht0 448(src)
Performance improvement varies by cpu and function. On Cortex A53 a typical speed up is 20%. On Atom Cederview, an ssse3 function to convert rgb to grey scale speeds up 9%.