Vectorization problems (and how to fix them)

dzhang314 / MultiFloats.jl

Fast, SIMD-accelerated extended-precision arithmetic for Julia

MIT License

75 stars 10 forks source link

Vectorization problems (and how to fix them) #10

Closed haampie closed 1 week ago

haampie commented 3 years ago

Current codegen is unfortunately not great, for instance a trivial loop like this:

julia> function trivial_sum(xs::Vector{Float64x4})
         t = zero(Float64x4)
         @simd for i in 1:length(xs)
           t += @inbounds xs[i]
         end
         t
       end

generates the following inner loop:

L48:
    vmovsd  -24(%rdx), %xmm5                # xmm5 = mem[0],zero
    vmovsd  -16(%rdx), %xmm6                # xmm6 = mem[0],zero
    vmovsd  -8(%rdx), %xmm7                 # xmm7 = mem[0],zero
    vaddsd  %xmm5, %xmm4, %xmm9
    vsubsd  %xmm4, %xmm9, %xmm0
    vsubsd  %xmm0, %xmm9, %xmm3
    vsubsd  %xmm3, %xmm4, %xmm3
    vsubsd  %xmm0, %xmm5, %xmm0
    vaddsd  %xmm3, %xmm0, %xmm0
    vaddsd  %xmm6, %xmm2, %xmm3
    vsubsd  %xmm2, %xmm3, %xmm4
    vsubsd  %xmm4, %xmm3, %xmm5
    vsubsd  %xmm5, %xmm2, %xmm2
    vsubsd  %xmm4, %xmm6, %xmm4
    vaddsd  %xmm2, %xmm4, %xmm2
    vaddsd  %xmm7, %xmm1, %xmm4
    vsubsd  %xmm1, %xmm4, %xmm5
    vsubsd  %xmm5, %xmm4, %xmm6
    vsubsd  %xmm6, %xmm1, %xmm1
    vsubsd  %xmm5, %xmm7, %xmm5
    vaddsd  %xmm1, %xmm5, %xmm1
    vaddsd  (%rdx), %xmm8, %xmm5
    vaddsd  %xmm1, %xmm5, %xmm1
    vaddsd  %xmm0, %xmm3, %xmm5
    vsubsd  %xmm3, %xmm5, %xmm6
    vsubsd  %xmm6, %xmm5, %xmm7
    vsubsd  %xmm7, %xmm3, %xmm3
    vsubsd  %xmm6, %xmm0, %xmm0
    vaddsd  %xmm3, %xmm0, %xmm0
    vaddsd  %xmm2, %xmm4, %xmm3
    vsubsd  %xmm4, %xmm3, %xmm6
    vsubsd  %xmm6, %xmm3, %xmm7
    vsubsd  %xmm7, %xmm4, %xmm4
    vsubsd  %xmm6, %xmm2, %xmm2
    vaddsd  %xmm4, %xmm2, %xmm2
    vaddsd  %xmm0, %xmm3, %xmm4
    vsubsd  %xmm0, %xmm4, %xmm6
    vsubsd  %xmm6, %xmm4, %xmm7
    vsubsd  %xmm7, %xmm0, %xmm0
    vsubsd  %xmm6, %xmm3, %xmm3
    vaddsd  %xmm0, %xmm3, %xmm0
    vaddsd  %xmm0, %xmm2, %xmm0
    vaddsd  %xmm0, %xmm1, %xmm0
    vaddsd  %xmm0, %xmm4, %xmm1
    vsubsd  %xmm4, %xmm1, %xmm2
    vsubsd  %xmm2, %xmm0, %xmm0
    vaddsd  %xmm1, %xmm5, %xmm2
    vsubsd  %xmm5, %xmm2, %xmm3
    vsubsd  %xmm3, %xmm1, %xmm1
    vaddsd  %xmm2, %xmm9, %xmm4
    vsubsd  %xmm9, %xmm4, %xmm3
    vsubsd  %xmm3, %xmm2, %xmm3
    vaddsd  %xmm3, %xmm1, %xmm2
    vsubsd  %xmm3, %xmm2, %xmm3
    vsubsd  %xmm3, %xmm1, %xmm3
    vaddsd  %xmm3, %xmm0, %xmm1
    vsubsd  %xmm3, %xmm1, %xmm3
    vsubsd  %xmm3, %xmm0, %xmm8
    addq    $32, %rdx
    decq    %rcx
    jne L48

By relaxing the type restrictions in MultiFloats.jl a bit, and patching VectorizationBase.jl to add another Vec type that doesn't allow for certain fast-math ops, it should be possible to get a 4x speedup on AVX2 and potentially 8x on AVX512 (well, that's the upper bound :p).

A minimal set of changes to allow to use VectorizationBase is here: https://github.com/haampie/MultiFloats.jl/commit/0e83c05213e4fe3647dc405eaf62763ca46d4989

The idea is to work with a type MultiFloat{Vec{M,T},N} and run the basic *, /, +, and - operations on 2, 4, or 8 numbers simultaneously.

dzhang314 commented 3 years ago

Hey @haampie , I've unfortunately noticed the same issue. I've been able to get very simple loops like @inbounds c[i] = a[i] + b[i] to vectorize for Float64x[2, 3, 4] on -O3, but it breaks down for larger MultiFloat types, and reduction loops have never worked. The failure of reductions to vectorize is totally expected, since Julia doesn't know that MultiFloat addition can be reordered -- only that the scalar additions within the MultiFloat addition function can be.

There are two things I would like to fix here. First, there are loops like @inbounds c[i] = a[i] + b[i] * c[i] for Float64x8 which Julia should in principle be able to vectorize, but doesn't -- I assume there must be some threshold inside Julia or LLVM that says "don't bother trying to vectorize loops past a certain size." Second, in order to enable vectorization of reduction loops, I would like to be able to inform Julia that inside a @simd block, MultiFloat arithmetic operations can be freely re-associated.

It looks like the intent of the changes you're proposing here to allow programmers to work around these issues by writing explicitly vectorized code using MultiFloat{Vec{M,T}, N}. If that's the case, then I would like to avoid generalizing the type restrictions all the way down to Real -- I don't want people thinking they can get a 256-bit integer by writing MultiFloat{Int256, 2}. Instead, I think we should proceed by changing MultiFloats.AF to a Union of Float16, Float32, Float64, BigFloat, and Vec types.

(Indeed, AbstractFloat was already too permissive to begin with -- MultiFloat{Float64x2, 2} should not be allowed either, since the MultiFloat arithmetic algorithms depend crucially on IEEE round-to-nearest semantics for error propagation. The only reason I used AbstractFloat instead of IEEEFloat in the first place was that I wanted to do some testing with MultiFloat{BigFloat, N}.)

haampie commented 3 years ago

Just as an example:

using MultiFloats, VectorizationBase

using VectorizationBase: extractelement, pick_vector_width
using MultiFloats: renormalize

@generated function MultiFloatOfVec(fs::NTuple{M,MultiFloat{T,N}}) where {T,M,N}
    exprs = [:(Vec($([:(fs[$j]._limbs[$i]) for j=1:M]...))) for i=1:N]

    return quote
        MultiFloat(tuple($(exprs...)))
    end
end

@generated function TupleOfMultiFloat(fs::MultiFloat{Vec{M,T},N}) where {T,M,N}
    exprs = [:(MultiFloat(tuple($([:(extractelement(fs._limbs[$j], $i)) for j=1:N]...)))) for i=0:M-1]
    return quote
        tuple($(exprs...))
    end
end

function trivial_sum(xs)
    t = zero(eltype(xs))
    @inbounds for x in xs
        t += x
    end
    return t
end

function vectorized_sum(xs::Vector{MultiFloat{T,N}}) where {T,N}
    M = pick_vector_width(T)

    t = zero(MultiFloat{Vec{M,T},N})

    @inbounds for i = 1:M:length(xs)-M+1
        t += MultiFloatOfVec(ntuple(k -> xs[i + k - 1], M))
    end

    return +(TupleOfMultiFloat(t)...)
end

function trivial_dot(xs, ys)
    t = zero(eltype(xs))
    @inbounds for i = 1:length(xs)
        t += xs[i] * ys[i]
    end
    return t
end

function vectorized_dot(xs::Vector{MultiFloat{T,N}}, ys::Vector{MultiFloat{T,N}}) where {T,N}
    M = pick_vector_width(T)

    t = zero(MultiFloat{Vec{M,T},N})

    @inbounds for i = 1:M:length(xs)-M+1
        x = MultiFloatOfVec(ntuple(k -> xs[i + k - 1], M))
        y = MultiFloatOfVec(ntuple(k -> ys[i + k - 1], M))
        t += x * y
    end

    return +(TupleOfMultiFloat(t)...)
end

using BenchmarkTools

random_vec(::Type{MultiFloat{T,N}}, k) where {T,N} =
    [renormalize(MultiFloat(ntuple(i -> rand(T) * eps(T)^(i-1), N))) for _ = 1:k]

function benchmark_dot(::Type{T}) where {T<:MultiFloat}
    xs = random_vec(T, 4096)
    ys = random_vec(T, 4096)

    @show vectorized_dot(xs, ys) - trivial_dot(xs, ys)

    vectorized = @benchmark vectorized_dot($xs, $ys)
    trivial = @benchmark trivial_dot($xs, $ys)

    return vectorized, trivial
end

function benchmark_sum(::Type{T}) where {T<:MultiFloat}
    xs = random_vec(T, 4096)

    @show vectorized_sum(xs) - trivial_sum(xs)

    vectorized = @benchmark vectorized_sum($xs)
    trivial = @benchmark trivial_sum($xs)

    return vectorized, trivial
end

benchmarks timings:

julia> benchmark_sum(Float64x8)
vectorized_sum(xs) - trivial_sum(xs) = 0.0
(Trial(59.062 μs), Trial(240.739 μs))

julia> benchmark_sum(Float64x4)
vectorized_sum(xs) - trivial_sum(xs) = 3.1115076389305709e-61
(Trial(21.876 μs), Trial(88.099 μs))

julia> benchmark_dot(Float64x8)
vectorized_dot(xs, ys) - trivial_dot(xs, ys) = -1.3849467926678604e-127
(Trial(328.814 μs), Trial(854.370 μs))

julia> benchmark_dot(Float64x4)
vectorized_dot(xs, ys) - trivial_dot(xs, ys) = -7.7787690973264271e-62
(Trial(37.992 μs), Trial(128.553 μs))

To replicate, install a patched VectorizationBase:

] add https://github.com/haampie/VectorizationBase.jl#without-fastmath-everywhere

I think this is the best way to do it in Julia

haampie commented 3 years ago

Seems like dot doesn't entirely inline :(.

haampie commented 3 years ago

Some more data points for AVX-512 on an Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz with vectors of size 2^13:

julia> benchmark_sum(Float64x8)
vectorized_sum(xs) - trivial_sum(xs) = -5.9091063153828709e-126
(Trial(124.202 μs), Trial(499.626 μs))

julia> benchmark_sum(Float64x7)
vectorized_sum(xs) - trivial_sum(xs) = -4.5240823300086601e-109
(Trial(100.521 μs), Trial(422.145 μs))

julia> benchmark_sum(Float64x6)
vectorized_sum(xs) - trivial_sum(xs) = 6.7116512220867355e-93
(Trial(78.467 μs), Trial(339.790 μs))

julia> benchmark_sum(Float64x5)
vectorized_sum(xs) - trivial_sum(xs) = -4.7498927053019445e-77
(Trial(47.538 μs), Trial(257.020 μs))

julia> benchmark_sum(Float64x4)
vectorized_sum(xs) - trivial_sum(xs) = -2.6058876476043531e-60
(Trial(33.474 μs), Trial(187.179 μs))

julia> benchmark_sum(Float64x3)
vectorized_sum(xs) - trivial_sum(xs) = -3.7835058536770061e-44
(Trial(17.097 μs), Trial(121.135 μs))

julia> benchmark_sum(Float64x2)
vectorized_sum(xs) - trivial_sum(xs) = -4.543838814073028e-28
(Trial(9.002 μs), Trial(63.851 μs))

and

julia> benchmark_dot(Float64x8)
vectorized_dot(xs, ys) - trivial_dot(xs, ys) = 3.7855212332921517e-126
(Trial(555.383 μs), Trial(1.735 ms))

julia> benchmark_dot(Float64x7)
vectorized_dot(xs, ys) - trivial_dot(xs, ys) = -7.4846950312643274e-110
(Trial(275.906 μs), Trial(1.280 ms))

julia> benchmark_dot(Float64x6)
vectorized_dot(xs, ys) - trivial_dot(xs, ys) = 2.7865337663127964e-93
(Trial(200.465 μs), Trial(855.838 μs))

julia> benchmark_dot(Float64x5)
vectorized_dot(xs, ys) - trivial_dot(xs, ys) = -6.0453179885661112e-77
(Trial(131.219 μs), Trial(558.778 μs))

julia> benchmark_dot(Float64x4)
vectorized_dot(xs, ys) - trivial_dot(xs, ys) = 1.2834969010588605e-60
(Trial(89.520 μs), Trial(300.694 μs))

julia> benchmark_dot(Float64x3)
vectorized_dot(xs, ys) - trivial_dot(xs, ys) = -3.4331812375958018e-44
(Trial(36.687 μs), Trial(122.643 μs))

julia> benchmark_dot(Float64x2)
vectorized_dot(xs, ys) - trivial_dot(xs, ys) = 3.0292258760486853e-28
(Trial(15.735 μs), Trial(63.286 μs))

Summary:

Type	speedup sum	speedup dot
`Float64x8`	`4.02`	`3.12`
`Float64x7`	`4.20`	`4.64`
`Float64x6`	`4.33`	`4.27`
`Float64x5`	`5.41`	`4.26`
`Float64x4`	`5.59`	`3.36`
`Float64x3`	`7.09`	`3.34`
`Float64x2`	`7.09`	`4.02`

So, not enirely the 8x speedup I hoped for, I think the issue here is inlining doesn't work properly.

Maybe we can inline code by hand?

there was still some issue with bounds checks. after fixing it I'm getting:

Type	speedup sum	speedup dot
`Float64x8`	`4.33`	`3.19`
`Float64x7`	`4.46`	`4.76`
`Float64x6`	`5.12`	`4.39`
`Float64x5`	`5.83`	`4.48`
`Float64x4`	`5.71`	`3.50`
`Float64x3`	`7.11`	`3.58`
`Float64x2`	`7.00`	`4.32`

haampie commented 3 years ago

Ok, seems like there's no real inlining problems after all. The issue is the code gen for the transform from

NTuple{N,MultiFloat{Float64,M}} -> MultiFloat{Vec{N,Float64},M}

It's pretty much a matrix-transpose in the registers, I've written that kernel before: https://github.com/haampie/FastTranspose.jl/blob/master/src/FastTranspose.jl#L34-L41

haampie commented 3 years ago

I've figured out the problem with dot, it's the julia codegen that creates an +(a, ...) expression with too many args.

With an improved transpose kernel I'm getting a 5.40x speedup for dot and a 5.66x speedup for sum for Float64x8 on AVX512 :)

The resulting assembly is quite pretty for dot: https://gist.github.com/haampie/e2e52718208ee7997fc4a34a4649af97

haampie commented 3 years ago

I've improved the register-transpose kernel for general input sizes (for instance Float64x7 requires a 7x8 transpose on AVX-512), and I'm getting the following results:

Type	speedup sum	speedup dot
`Float64x8`	`5.65`	`5.38`
`Float64x7`	`5.81`	`5.32`
`Float64x6`	`5.79`	`5.18`
`Float64x5`	`5.79`	`5.13`
`Float64x4`	`5.77`	`5.45`
`Float64x3`	`7.17`	`3.58`
`Float64x2`	`7.04`	`5.07`

Not sure this can be improved really.

For reference the struct-of-arrays version which doesn't require shuffling has the following speedups vs scalar:

Type	speedup sum	speedup dot
`Float64x8`	`5.77`	`5.45`
`Float64x7`	`5.80`	`5.50`
`Float64x6`	`5.75`	`5.78`
`Float64x5`	`5.77`	`5.93`
`Float64x4`	`5.78`	`5.81`
`Float64x3`	`7.04`	`5.61`
`Float64x2`	`7.12`	`6.76`

haampie commented 3 years ago

Another result on AVX-512:

julia> @benchmark LinearAlgebra.qrfactUnblocked!(mat) setup=(mat=StructArray(rand(Float64x2, 400, 400)))
BenchmarkTools.Trial: 
  memory estimate:  168.47 KiB
  allocs estimate:  2395
  --------------
  minimum time:     81.934 ms (0.00% GC)
  median time:      84.113 ms (0.00% GC)
  mean time:        83.887 ms (0.00% GC)
  maximum time:     86.996 ms (0.00% GC)
  --------------
  samples:          54
  evals/sample:     1

julia> @benchmark LinearAlgebra.qrfactUnblocked!(mat) setup=(mat=rand(Double64, 400, 400))
BenchmarkTools.Trial: 
  memory estimate:  6.38 KiB
  allocs estimate:  1
  --------------
  minimum time:     627.563 ms (0.00% GC)
  median time:      628.127 ms (0.00% GC)
  mean time:        628.101 ms (0.00% GC)
  maximum time:     628.408 ms (0.00% GC)
  --------------
  samples:          8
  evals/sample:     1

this requires still some hand-vectorization of axpy! and dot/norm, but basically i think the best strategy is to make people use StructArray of arrays of MultiFloats.

dzhang314 commented 3 years ago

Hey @haampie , this is fantastic work! I am especially impressed with your fast zmm transpose kernel; I've been trying to think of a good way to do the MultiFloat{Vec{K,Float64},N} <-> NTuple{K,MultiFloat{Float64,N}} transformation for some time. It's amazing that the AoS versions of sum and dot you put together are nearly as fast as (and in some cases, even faster than) the SoA versions. Your performance insights are also much appreciated, as I do not have access to an AVX-512 capable CPU.

Here is what I am going to do. Starting tonight, with the imminent release of MultiFloats.jl v1.0, I am simply going to drop the T <: AbstractFloat type constraint everywhere it occurs. Even though this might wrongly suggest the possibility of, say, MultiFloat{Int, N} or MultiFloat{Complex, N} to new users of the library, you have put forward an overwhelming case that MultiFloat{Vec{K, T}, N} provides serious benefits, even for non-Vec-aware code. Considering that there may be other such types (I believe SIMD.jl has its own Vec type which is not even a subclass of Real), it doesn't make sense for me to insist on an impractical type hierarchy.

Moving forward, I would also like MultiFloats.jl to provide explicitly vectorized implementations for sum(::Vector{MultiFloat}) and dot(::Vector{MultiFloat}). I hope you will not mind if I follow your example for these and other simple kernels.

By the way, I had a look at your patch to VectorizationBase, and I believe it should be safe to use MultiFloats.jl with afn arcp nsz; it's only reassoc that I depend on in a crucial way. It's a shame that VectorizationBase doesn't provide a mechanism for selectively disabling reassoc only.

haampie commented 3 years ago

Just a few comments: it would be interesting to look into what's necessary to make LLVM autovectorize reductions, because that would be much more generic than implementing a few vectorized routines. This is a good starting point: https://docs.julialang.org/en/v1/devdocs/llvm/#Passing-options-to-LLVM

Another idea I had is to add a simple "struct of arrays" array type to this package. Basically a permuted array where the # multifloats is in the last dimension. So for instance a matrix of size m x n of Float64xN is represented as a 3 dimensional array of dimensions (m, n, N).

A minimal working example:

struct MFArray{T,M,N,TA} <: AbstractArray{MultiFloat{T,M},N}
    A::TA
end

import Base: size, getindex, setindex, view, IndexStyle
using Base: OneTo
using Base.Cartesian: @ntuple, @nexprs
export MFArray

Base.size(A::MFArray) = reverse(Base.tail(reverse(size(A.A))))

Base.IndexStyle(x::MFArray) = IndexStyle(x.A)

@generated function Base.getindex(A::MFArray{T,M,N}, i::Int) where {T,M,N}
    quote
        $(Expr(:meta, :inline, :propagate_inbounds))
        dist = length(A)
        MultiFloat{T,M}(@ntuple $M j -> A.A[i + (j - 1) * dist])
    end
end

@generated function Base.setindex!(A::MFArray{T,M,N}, v::MultiFloat{T,M}, i::Int) where {T,M,N}
    quote
        $(Expr(:meta, :inline, :propagate_inbounds))
        dist = length(A)
        @nexprs $M j -> A.A[i + (j-1) * dist] = v._limbs[j]
        return v
    end
end

@generated function Base.getindex(A::MFArray{T,M,N}, I::Vararg{Int,N}) where {T,M,N}
    quote
        $(Expr(:meta, :inline, :propagate_inbounds))
        MultiFloat{T,M}(@ntuple $M j -> A.A[I..., j])
    end
end

@generated function Base.setindex!(A::MFArray{T,M,N}, v::MultiFloat{T,M}, I::Vararg{Int,N}) where {T,M,N}
    quote
        $(Expr(:meta, :inline, :propagate_inbounds))
        @nexprs $M j -> A.A[I..., j] = v._limbs[j]
        return v
    end
end

function MFArray(A::AbstractArray{MultiFloat{T,M},N}) where {T,M,N}
    A′ = permutedims(reshape(reinterpret(T, A), M, size(A)...), circshift(OneTo(N+1), -1))
    return MFArray{T,M,N,typeof(A′)}(A′)
end

@inline function view(A::MFArray{T,M,N}, I::Vararg{Any,N}) where {T,M,N}
    B = view(A.A, I..., :)
    return MFArray{T,M,ndims(B)-1,typeof(B)}(B)
end

results in the following:

julia> function simple_muladd(C, A, B)
         @inbounds @simd ivdep for i ∈ eachindex(A)
           C[i] += A[i] * B[i]
         end
         return nothing
       end
simple_muladd (generic function with 1 method)

julia> @benchmark simple_muladd(C, A, B) setup=(A=MFArray(fill(Float64x4(1.0), 100)); B=MFArray(fill(Float64x4(1.0), 100)); C=MFArray(fill(Float64x4(1.0), 100)))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     654.309 ns (0.00% GC)
  median time:      673.235 ns (0.00% GC)
  mean time:        673.953 ns (0.00% GC)
  maximum time:     811.642 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     162

julia> @benchmark simple_muladd(C, A, B) setup=(A=fill(Float64x4(1.0), 100); B=fill(Float64x4(1.0), 100); C=fill(Float64x4(1.0), 100))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     859.617 ns (0.00% GC)
  median time:      878.817 ns (0.00% GC)
  mean time:        880.429 ns (0.00% GC)
  maximum time:     1.439 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     60

The problem however is that it requires @simd ivdep, probably cause the compiler cannot assume the pointers aren't aliasing.

haampie commented 3 years ago

Using an array like I suggested above does have the downside that it runs into cache thrashing sometimes.

For instance, I have some code to do reduction to band (first step in a parallel schur decomp), and it's slow at 512x512 compared to 513x513:

julia> A = MFArray(rand(Float64x4, 512, 512)); @time to_band!(A, p);
  3.015606 seconds (2 allocations: 352 bytes)

julia> A = MFArray(rand(Float64x4, 513, 513)); @time to_band!(A, p);
  1.789323 seconds (2 allocations: 352 bytes)

because loading A[i, j, 1], A[i, j, 2], A[i, j, 3], A[i, j, 4] all land in the same cache set. Without MFArray it looks like this:

julia> A = rand(Float64x4, 512, 512); @time to_band!(A, p);
  2.249070 seconds (1 allocation: 336 bytes)

julia> A = rand(Float64x4, 513, 513); @time to_band!(A, p);
  2.052841 seconds (1 allocation: 336 bytes)

There's some slowdown here too (I guess when iterating over columns) but it's not as dramatic as above.

dzhang314 commented 3 years ago

As a matter of fact, I actually have a similar struct-of-arrays construct that I started testing privately a long time ago. However, I hesitate to release an array-like construct as part of the public MultiFloats.jl API, since I have no idea how the Julia AbstractArray type hierarchy works. Do you have a good understanding of what needs to be done in order to make an array-like type work with eachindex, iterate, CartesianIndices, etc.?

haampie commented 3 years ago

In principle size, getindex, and setindex! are sufficient. It's slightly more efficient to implement IndexStyle to get linear indices whenever possible like I did above (so, for plain arrays you get linear indexing, for views you get cartesian indexing).

Only remaining function to implement is similar, which should ensure that e.g. A[1:3, 1:3] also returns an MFArray instead of Array{MultiFloat{..}} as it does right now. When that's in it should be good to go.

Btw, I'm trying to write a microkernel for tall-and-skinny QR; what I found is for Q * A it's useful to store Q as an Array{Float64xN} and A with dims permuted (matrix transpose + limbs to 2nd dimension), so I guess MFArray like above isn't the answer to everything.

dzhang314 commented 1 week ago

Closing this old thread since explicit vectorization is now supported in MultiFloats.jl v2.0. I have, for the time being, decided against introducing my own MultiFloatArray type and instead opted for mfvgather and mfvscatter operations on standard Array{MultiFloat} types.