Add support for SIMD multi-arch exports

Ivorforce commented 1 month ago

I discussed this on Discord with Claire: Some people might be willing to trade a substantially larger binary size for better top speeds when running with a good architecture.

I see 3 ways to approach it: 1) Dynamically rebind vatensor functions to different binaries, depending on runtime arch. This would probably be a lot of work, but it would unify it under one binary and thus be one self-enclosed binary with a common interface. Plus, a lot of non-critical code could be un-duplicated, like how reductions don't really benefit from AVX512 (from preliminary tests) 2) Offer multiple complete binaries based on arch feature tags. These don't exist yet so Godot itself would have to be involved. It's less effort overall but also a worse trade-off. 3) Fork / extend xtensor to support runtime checks itself. This may actually already be implemented, I just gotta check "for real". I don't think it is, but it may.

There might be another way, but i certainly don't know it.

Ivorforce commented 1 week ago

On macOS, it's possible to add a x86_64h slice which supports haswell+, instead of the default x86_64 slice. This should benefit us especially much, since it supports avx2, sse4.2 and more by default. This is added to the binary and can speed up execution.

Ivorforce commented 4 days ago

The slice adds significant size to the binary, and only speeds up some functions substantially enough to warrant the difference.

I think I have a better solution:

1) Determine which functions benefit from which SIMD additions (i can only test up to avx2 unfortunately) 2) Make a python script that uses features.py and scu.py functionality to compile all of these files separately, using the appropriate flag (I think -march=x86-64-v2 may be most appropriate as a first test).

Then, either:

Call all vatensor functions for the automatically determined SIMD appropriate for the machine.
Within each vatensor function, add an if (avx2) { va::avx2::function(a, b, c...) } to add indirection after the call.

The former should be faster, but the latter makes for a cleaner SIMD agnostic interface. I think I prefer the latter for this reason. The difference shouldn't be huge though. Duplication can be avoided by exposing the 'smallest common denominator' functions separately for those that do some logic before dispatching with SIMD differences, though most are pretty minimal already.

The upside of this solution is it's agnostic to the dispatch target - theoretically, this could include BLAS dispatch e.g. if BLAS is installed locally (or loaded otherwise). The downside is it's a bit verbose with the dispatch call, but that could be made minimal, probably.

Ivorforce / NumDot

Add support for SIMD multi-arch exports #102