SIMD support - Githubissues

ncannasse commented 10 years ago

I'm not sure how much @hughsando have been thinking about adding SIMD support, but that would be interesting to have.

Today Mozilla has released SIMD.js which is an API that will get JIT'ed with SIMD instructions.

https://hacks.mozilla.org/2014/10/introducing-simd-js/

I think we could follow the same API in Haxe/C++ so it does produce SIMD instructions that gets compiled.

For GCC : https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html For MSVC : http://msdn.microsoft.com/en-us/library/708ya3be(v=vs.90).aspx

delahee commented 10 years ago

Hey Nico.

Simd support generates many data bloats especially if we cannot control data ptr inlining. A bad Simd implementation can be slower or equal to no Simd in perf and worst in code size. This is due to the fact that switching the cpu to vfpu mode has a high fixed cost per hardware thread.

If you want to do it well, haxe has to be able to allocate properly aligned vectors and/or be able to automatically generate Simd high level constructs (called intrinsincs) otherwise the gain will be very small.

For Cpp the best we can do is use an third party library or do the wrapping (which involve headload of work)

For example some android have Neon support, some do not. Managing an android release that want to use neon is thus totally a headache...

My 2cent is that you start to do something for Simd.js. The Cpp Simd work thing is just too big of a nuisance if done improperly and should be lead very separately.

Good luck and be wise. Le 31 oct. 2014 20:29, "Nicolas Cannasse" notifications@github.com a écrit :

I'm not sure how much @hughsando https://github.com/hughsando have been thinking about adding SIMD support, but that would be interesting to have.

Today Mozilla has released SIMD.js which is an API that will get JIT'ed with SIMD instructions.

https://hacks.mozilla.org/2014/10/introducing-simd-js/

I think we could follow the same API in Haxe/C++ so it does produce SIMD instructions that gets compiled.

For GCC : https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html For MSVC : http://msdn.microsoft.com/en-us/library/708ya3be(v=vs.90).aspx

— Reply to this email directly or view it on GitHub https://github.com/HaxeFoundation/hxcpp/issues/126.

ncannasse commented 10 years ago

@delahee things are more simple than that. You can just define a Float4 type that is an aligned struct that contains four floats, then provides add/div/mult/rsqrt that performed vectorized operations. If the hardware does not supports it you fallback on using FPU, that will be exactly the same is if you would have written non-SIMD code.

delahee commented 10 years ago

Agreed but the backend code for cpp is a massive piece of software and engineering. Also you'll have to provide float32x4 float64x4 (because on neon register are 64bits...) and maybe i8x4 i32x4 because vfpu context switching etc...

I just hinted at the fact that on Hugh side it is not just a small amount work. I am very aware of SIMD issues, pitfalls etc for many years so I wad just addin a little "be careful" contribution because basically, I know that I will end up working(debuggin) with them.

Aside that I am all Very happy (and eager) that we can achieve this, I just vote we start small and expand later ( like what we did for sys, html5 etc )...So you can start soon enough and we catch up as need arise.

( from a personnal point of view, I am very happy the Hf leads this effort :k )

2014-11-01 10:23 GMT+01:00 Nicolas Cannasse notifications@github.com:

@delahee https://github.com/delahee things are more simple than that. You can just define a Float4 type that is an aligned struct that contains four floats, then provides add/div/mult/rsqrt that performed vectorized operations. If the hardware does not supports it you fallback on using FPU, that will be exactly the same is if you would have written non-SIMD code.

— Reply to this email directly or view it on GitHub https://github.com/HaxeFoundation/hxcpp/issues/126#issuecomment-61362664 .

David Elahee

hughsando commented 10 years ago

This can mostly be done with an extern/header file style extension. For stack temps, it could be done this way. The main differences I see are the need to align structures at allocation time, and if we want to support 128 bit primitives via Dynamic. Maybe I can look at this the same time I look at supporting 64 bits (eg, int64). The developer will need to check for support before calling the functions, because without JIT, there is no good way of falling back to scalar implementation without killing any performance gains. I would see hxcpp just providing 2 types, Simd64 and Simd128, and then use abstracts to interpret, cast and process these "bit buckets" - moving most of the complexity into the haxe code.

delahee commented 10 years ago

As of support, maybe libs like vc already know how to do hot switching efficiently (after all snatching the function pointer could suffice)...

As of deployment, the other classic solution is to pack one overlay with Simd and an another one without (ala universal binaries), I think opening this possibility would be cool (although I would definitely not use it)...

Gl. Le 3 nov. 2014 06:45, "Hugh Sanderson" notifications@github.com a écrit :

This can mostly be done with an extern/header file style extension. For stack temps, it could be done this way. The main differences I see are the need to align structures at allocation time, and if we want to support 128 bit primitives via Dynamic. Maybe I can look at this the same time I look at supporting 64 bits (eg, int64). The developer will need to check for support before calling the functions, because without JIT, there is no good way of falling back to scalar implementation without killing any performance gains. I would see hxcpp just providing 2 types, Simd64 and Simd128, and then use abstracts to interpret, cast and process these "bit buckets" - moving most of the complexity into the haxe code.

— Reply to this email directly or view it on GitHub https://github.com/HaxeFoundation/hxcpp/issues/126#issuecomment-61442504 .

hughsando commented 8 years ago

I'm closing this one now. I think the native tools are at a point where this can be developed as part of a haxelib and then added to hxcpp if appropriate. The only thing that might be needed is a 128-bit aligned GC alloc. I can look at adding this if there is sufficient interest.

HaxeFoundation / hxcpp

SIMD support #126