Inefficient x64 codegen for conversion instructions

abrown commented 4 years ago

Certain SIMD conversions seem to have inefficient lowerings in x64. f32x4.convert_i32x4_u is lowered to 8 instruction by v8. The signed version, f32x4.convert_i32x4_s, on the other hand, is lowered to a single instruction.

I can't find the v8 implementation for i32x4.trunc_sat_f32x4_s and i32x4.trunc_sat_f32x4_u but I think the situation is the same: the signed version should have a single instruction lowering to CVTTPS2DQ and the unsigned version will require some longer sequence. [edit: this is incorrect, see #173 for a more correct discussion of this inefficiency]

The 64x2 versions of these instructions were dropped in #178. For similar reasons (@ngzhian: "because it is uncommon for such instructions to be used, and hardware support is not widespread"), should we remove the unsigned versions?

zeux commented 4 years ago

the signed version should have a single instruction lowering to CVTTPS2DQ

This is not the case, see #173.

abrown commented 4 years ago

Ah, good point; I didn't read #173 carefully enough. Let me edit out that part of the issue above.

tlively commented 4 years ago

Is it uncommon for these instructions to be used? At least as a user if I'm trying to convert a 64x2 vector, I only have two lanes to manually convert. If we remove the 32x4 conversions, users would have to write 4x more code to convert each lane. That ergonomic loss makes it more important to have 32x4 conversions than 64x2 conversions even if they are similarly unpopular. My guess is that the 32x4 conversions would be more common, although I do no know for sure.

zeux commented 4 years ago

Note that 64-bit conversions were dropped because they didn’t have an efficient implementation on any architecture, short of AVX512. What does lowering of unsigned conversions look like for ARM/Power?

abrown commented 4 years ago

For ARM in V8 it takes 1 instruction to lower f32x4.convert_i32x4_u.

jan-wassenberg commented 4 years ago

Is it uncommon for these instructions to be used?

32 bit [u]int<->float conversions are indeed common. Unfortunately, I doubt many developers are aware that signed is cheaper than unsigned on x86. If they were, signed would be sufficient in a large majority of cases; they would just have to cast to int first.

abrown commented 4 years ago

cc: @rrwinterton, @arunetm

arunetm commented 4 years ago

32 bit [u]int<->float conversions are indeed common. Unfortunately, I doubt many developers are aware that signed is cheaper than unsigned on x86. If they were, signed would be sufficient in a large majority of cases; they would just have to cast to int first.

Agree that the inefficiency of unsigned 32 bit conversions seems to be an easy to miss detail for developers. The spec's criteria for instr evaluation calls out for good support across modern architectures, avoidance of perf cliffs and evidences of wide app/library usages. The 8 instr vs, 1 instr is a high difference in cost of implementation IMO and could very well hide unexpected perf cliffs. So we may need to qualify their inclusion with more reasoning.

Any objections on signed 32 bit conversions being sufficient for the majority of cases?

rrwinterton commented 4 years ago

True this is a problem since we don't have the AVX3 ISA but this is fixed in ICL. Some cpu's can't take advantage of this moving forward. Jan you may find this interesting. https://godbolt.org/z/JD3mjf See this instruction selected and a nice unroll. I think we can generate some really cool code if we do this right with the right header files and even extend the wasm_simd256.h to include potential AVX3 instructions with ymm registers. I have a wasm_simd256.h header file I am working on. We should use the workload you are using as a proposed mix to start with and see what AVX3 instructions can provide.

jan-wassenberg commented 4 years ago

@rrwinterton Nice, lots of goodies in AVX3, thanks for sharing :) Good idea, I'd be happy to help with testing AVX2 and 3 benefits using JPEG XL.

dtig commented 4 years ago

I agree with @tlively about the ergonomic loss of not having the unsigned conversions, and it would be interesting to see what the exact application usage is.

This case is also particularly interesting because as @rrwinterton pointed out, the vcvtudq2ps instruction available with the AVX512 feature flag offers a 128-bit version of this for a single instruction implementation, so it is possible that engines in the future are able to take advantage of this. A broader question to answer here, how do we weigh current performance inefficiencies which may be addressed in the future with engines having better vector support?

My vote here would be for a portable abstraction that is future proof, i.e. if there are AVX+ instructions that map to one instruction, we should keep these in so engines can optimize when support is available, and this does fit the criteria of support being available in modern architectures, it's just that engines don't support them for codegen at this time. Choosing not to do this would be that we fall into trying to standardize subsets of extensions as they become available, and that model doesn't sound very appealing. As this is mostly an Intel-ism that IIRC we've discussed in the past (possibly before my time on this proposal), I'm curious to see what other opinions there are about this.

rrwinterton commented 4 years ago

In order to take advantage of AVX3 we would have to detect the platforms capable of doing the AVX3 VL instructions. There are several that we should be able to reduce the number of instructions for Intel platforms if they are capable of AVX3. I am creating a mapping of those instructions along with the promotion proposal for 256 length vectors. I am wondering if we don't use any native AVX2 and AVX ISA and just use AVX3 VL for the next generation by the time we move to the next gen WASM SIMD? Maybe in about 3 weeks I can present this idea? I will let @dtig decide the time frame. earliest is 3+ weeks for me.

zeux commented 4 years ago

I feel like AVX512 is not going to be practical for years to come as the dominant target ISA. The first laptop CPU that supports it shipped late last year. There’s no support from AMD side right now either. It’s interesting as a codegen option but not as a baseline.

WebAssembly / simd

Inefficient x64 codegen for conversion instructions #190