JuliaSIMD / LoopVectorization.jl

Macro(s) for vectorizing loops.
MIT License
742 stars 66 forks source link

Deprecate LV for Julia >= 1.11-DEV #519

Closed chriselrod closed 9 months ago

codecov[bot] commented 9 months ago

Codecov Report

Attention: 14 lines in your changes are missing coverage. Please review.

Comparison is base (d2f749d) 88.64% compared to head (ad2139e) 80.29%.

Files Patch % Lines
src/LoopVectorization.jl 16.66% 10 Missing :warning:
src/condense_loopset.jl 50.00% 4 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #519 +/- ## ========================================== - Coverage 88.64% 80.29% -8.35% ========================================== Files 39 39 Lines 9600 9608 +8 ========================================== - Hits 8510 7715 -795 - Misses 1090 1893 +803 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

chriselrod commented 9 months ago

Fixes #518

stillyslalom commented 9 months ago

It's sad to see this project get mothballed, but I understand not wanting to deal with the maintenance burden when you're trying to get LoopModels up to parity. Do you think LoopModels will be usable in the v1.11-release timeframe, or is it still too early in its lifecycle to guess on a release date?

willow-ahrens commented 8 months ago

I agree, this package will be sorely missed. Thank you Chris for your work on this package, it is quite an achievement.

For future implementers, would you share any ideas towards a more minimal version of the package with less maintenance burden and less compile time? Something that e.g. faithfully unrolls the inner loop, manually applies simd instructions, and tries different loop permutations without getting too serious about it?

chriselrod commented 8 months ago

Something that e.g. faithfully unrolls the inner loop

LV will often unroll and SIMD one of the outer loops, not the inner most! This is an important point to emphasize when trying to replicate its performance, as vectorizing outer loops is often much more profitable (e.g. "unroll and jam").

Another major component for getting good performance is code generation.

Relatively little code in LV was dedicated towards what it should do, and a fairly substantial amount towards actual code generation. I saw an example recently where someone reproduced what LV did, but performance was over 2x worse, simply because LV takes a lot of care in its implementation to generate good code following the execution plan it lays out. LLVM does a surprisingly terrible job optimizing indexing behavior, and can introduce huge amounts of overhead if one isn't careful.

Another point is, if you care about architectures with wide vectors (especially AVX512), don't use scalar clean up loops, but predicates.

Unfortunately, llvmcalls are slow to compile. If possible,o not generate Julia Expr, but try and work on the LLVM level. This can help you avoid a host of other problems, such as LV not working with Julia's function multiversioning. If you really do want to stick with Julia, I'd suggest PRing Base to add tfunc support for SIMD vectors. Oscar and I (mostly Oscar) got a prototype working for the interpreter in a few minutes. That is, instead of needing llvmcall, we got Base.add_float, Base.mul_float, etc working on SIMD types like NTuple{N,Core.VecElement{Float64}}, running through the interpreter. Mostly, all we did was delete a bunch of asserts, and it "just worked". Of course, we'd need it working for code gen, too, but that shouldn't be hard.

If SIMD code can be written to use add_float, add_int, etc, instead of llvmcall, I think that could improve its compile times fairly substantially. I'd like add_float_fast, etc, working too. But LV actually doesn't apply all flags, so more granularity would be great, but that's an orthogonal issue (the fact that the nonans flag makes it difficult to check for nans makes it a nonstarter; LLVM propogates nonans more aggressively than I'd like).

In terms of maintenance burden, my suggestions would be to avoid anything that isn't standard, boring Julia code. Of course, that tends to be at odds with getting good performance. So your best bet would probably be to have a close, open dialogue with the core compiler team on getting standard and stable ways of doing everything you need that they approve of.

chriselrod commented 8 months ago

It's sad to see this project get mothballed, but I understand not wanting to deal with the maintenance burden when you're trying to get LoopModels up to parity. Do you think LoopModels will be usable in the v1.11-release timeframe, or is it still too early in its lifecycle to guess on a release date?

It's too early to guess on a release date, but I would not except it by Julia 1.11.