Open ghost opened 8 years ago
Hi. Not yet, but it's very likely we'll have these in the future
Changed title because Atomics are already implemented
Is anyone currently working on this? I would be happy to help if I can. Could you tell me what probably needs to be done and how complex it would be?
SIMD is supported at the optimization level (usually in loops) by LLVM natively, for explicit "I want to guarantee this will execute the exact SMD calls i specify" support, this will have to be supported in the codegen and language. LLVM represents this using vector types, and support for these will have to be added to crystal for basic mathematical operations on multiple values at once.
For more specialized architecture-specific SIMD instructions, you can create a shard wrapping inline assembler calls.
Thanks for the quick explanation! I'm pretty sure adding SIMD types to the language shouldn't be that hard because their implementation is probably very similar to the ones of the existing primitives (like Int32
or Float64
). In fact, basic operations such as add
work exactly the same way for vector and scalar types in LLVM.
I have no clue about assembly but I don't think it's a good idea to write native code for a language based on LLVM (which apparently has proper SIMD support).
After experimenting a bit I realized that LLVM is actually very smart when it comes to SIMD. For example, I can create two vectors of 32 single-precision floats and add them together like this:
%vp1 = alloca <32 x i32>
%v1 = load <32 x i32>, <32 x i32>* %vp1
%vp2 = alloca <32 x i32>
%v2 = load <32 x i32>, <32 x i32>* %vp2
%sum = add <32 x i32> %v1, %v2
Of course, such big vectors can't be stored inside any sort of registers. LLVM uses multiple registers to store these (at least I think so):
vmovdqa 256(%rsp), %ymm0
vmovdqa 288(%rsp), %ymm1
vmovdqa 320(%rsp), %ymm2
vmovdqa 352(%rsp), %ymm3
vpaddd 160(%rsp), %ymm1, %ymm1
vpaddd 192(%rsp), %ymm2, %ymm2
vpaddd 224(%rsp), %ymm3, %ymm3
vpaddd 128(%rsp), %ymm0, %ymm0
Please don't ask me how any of this works, I just see many different AVX registers being used subsequently. The size of the vector doesn't even have to be a power of two. We probably don't need to do any complicated checks because LLVM handles pretty much anything for us.
If you want, you can try experimenting with this. The implementation seems similar to the one for StaticArray, but instead of llvm array you must use LLVM vector. Searching for StaticArray in the compiler's source code could give you a path for doing it. It also seems something fun to experiment with.
I actually think vectors could be implemented mostly the same way as integers/floats/booleans. I mean, most IR instructions which work for scalars also work for vectors. For example, you can icmp
two integer vectors and get a boolean vector of the same size containing the results of the individual comparisons. Arithmetic operations and casts work the same way, too.
Quick update: I wrote a very early implementation of a Vector(T, N)
type (https://github.com/malte-v/crystal/tree/simd-support). Currently you can only get and set the individual elements, but it does use the SIMD registers. I'll add more functionalty soon.
I'm really not sure how the "front end" of the Vector
type should look like. Currently it's like StaticArray
where you specify the type and number of elements in the type arguments. However, this doesn't work well with the semantics a Vector
should have.
This is because the arithmetic you can do with a vector depends on what its element type is. For example, you may want to AND two integer vectors, but that wouldn't make any sense for a vector of floats. It would be great to be able to do "generic specializations", as discussed in #3298.
A workaround I used at first was to define macros like
macro check_int
{% raise "Method #{@def.name}#{@def.args} is only supported for vectors of integers" unless [Int8, Int16, Int32, Int64, Int128, UInt8, UInt16, UInt32, UInt64, UInt128].includes? T %}
end
and then call them in methods that should only be available to some vectors. But this doesn't work with @[Primitive(...)]
methods because these can't have a body.
Any ideas how to overcome this? I personally would like to keep the generic Vector(T, N)
type because it seems logical and natural to me.
Maybe make the primitive methods private and add the check for public methods?
@asterite Thanks, weird thing I didn't come up with this by myself.
What about specialized structs like Int32::Vector(N)
and Float64::Vector(N)
? With simple macro constructors like:
Int32.vector(1, 2, 3) => Int32::Vector(3)
It would avoid the issue of manually raising (let the compiler do its job) and allow a nicer API for int or float specific calls, for example Int32::Vector(N)#check
instead of Vector(T, N)#check_int
.
@ysbaddaden I really like that approach. It lets us omit all the "hacky" parts of the aforementioned implementation while also emphasizing the element type. I'll change the implementation to this.
I need some more help with the implementation. I'm trying to add the vector structs to the built-in types in program.cr
. The key part is:
types["Bool::Vector"] = vec_bool = @vec_bool = BoolVectorType.new self, bool, "Vector", value, ["N"], bool
vec_bool.struct = true
vec_bool.can_be_stored = false
This is also goes for the integer and float types. The arguments for the BoolVectorType
constructor are the program, the namespace, the type name, the base class, the generic type arguments and the element type which is equivalent to the namespace. From my understanding this should match:
# <top level>
struct SomeVectorElementType
struct Vector(N)
...
end
end
But this is not the case. Unless I'm explicitly requiring vectors.cr
, I get undefined constant Int32::Vector
. How should I add these sub-types to the Program
?
You need to do:
types["Bool"].types["Vector"] = vec_bool = ...
Types contain other types. Like in code.
Or just bool.types["Vector"] = ...
if bool
was defined in a previous line.
Though namespacing things like this makes little sense. Probably IntVector
, BoolVector
, FloatVector
is better.
In any case, without a concrete use case I don't know why we'd like to eventually merge this.
Thanks, works fine now.
I'd use vectors for CG, but SIMD is mostly useful when performing the same arithmetic on lots of data over and over again. Of course you can cross your fingers and hope the optimizer does its job, but I like to be sure my code is properly optimized. Either way, I think vectors are nice to have just for the sake of convenience (no need to loop over arrays).
SIMD is useful whenever there are operations to repeat over a set of values. Instead of manually operating on each value, you can use a single instruction to operate on many at once, supposedly speeding things up (at least since SSE).
It can be useful in geographical operations (k-means) over 2d or 3d points, to project polygons on a map, matrix transformations, applying transformations to rgba pixels... It can also be used in PRNG and digest calculations, see for example https://github.com/lemire/simdpcg
FYI, vector arithmetic and conversions are now working fine. I was able to reuse all the codegen logic from the scalar operations. However, boolean vectors don't work at all, every element gets printed as false
. Still figuring out what's wrong there.
I'm not sure how the ElementType::Vector(N)
architecture would work with generic types using Vector
. For example:
class Matrix(T, N, M) # T = element type; N = no. of rows; M = no. of columns
@rows : T::Vector(M)[N] # undefined constant T::Vector
@rows : Vector(T, M)[N] # works
end
I'd like to hear some thoughts on https://github.com/crystal-lang/crystal/issues/3298#issuecomment-500363243
@malte-v Given that Int::Vector
and Float::Vector
have different operations, I can imagine there being Int::Matrix
and Float::Matrix
because of the same reason...
@asterite I don't think having 12 different matrix classes makes sense because no one would ever AND two matrices or use some other int-specific operation. Using macro loops here will produce duplicate code.
edit: Wording
I am curious what is the status on this?
I wrote a Vector struct to use in raylib once, and i was inspired by how crystal's stdlib for "Complex" numbers changed the Numbers class to allow to write complexs numbers like 2 + 1.i
.
So i wrote it so you could do something like 1.x + 1.y
to write Vector[1,1]
, similarly how you would use base-vectors in math. here is a sample with the class and some operations for upto 4-dimensional vectors
https://play.crystal-lang.org/#/r/ehgt
I have been working on this myself to the point where I could port lemire/fastvalidate-utf-8
to Crystal natively (note: simdutf
is the production-ready choice that supports way more CPUs). You could take a look at the proof-of-concept branch here: https://github.com/HertzDevil/crystal/tree/feature/vector
A brief overview:
SIMD::IntVector(T, N)
, SIMD::FloatVector(T, N)
, SIMD::BoolVector(T, N)
, and SIMD::PointerVector(T, N)
are the actual vector types and a direct correspondence to the underlying LLVM vector types. T
is the element type, which must be Bool
for BoolVector
, for consistency. N
is a positive integer, but otherwise there are no restrictions. LLVM may or may not break on any given platform where a vector type doesn't correspond to an actual register, e.g. SIMD::FloatVector(Float32, 16)
on x86-64 without AVX512, but if a correspondence exists, that vector type is guaranteed to abide by the ABI.SIMD::Vector(T, N)
is the common module shared by those vector types. Following the recent work on ReferenceStorage
, none of these type names are hardcoded in the compiler..literal
is the constructor for vectors. Arguments must have identical types, no autocasting happens and generic type arguments cannot be supplied yet. If all arguments are literals, the resulting value is also a single LLVM vector literal. (Fun fact: the implementation for Slice.literal
originated from this)#unsafe_fetch
, #unsafe_copy_with
, and #unsafe_shuffle
correspond to LLVM's extractelement
, insertelement
, and shufflevector
instructions. For shufflevector
the mask must be a constant literal; non-constant masks are only possible with platform-specific intrinsics, such as this proof-of-concept's Intrinsics::X86.mm_shuffle_epi8
..cast
performs bitcasts between vector types with the same total bit sizes, e.g. from SIMD::IntVector(Int8, 32)
to SIMD::FloatVector(Float32, 8)
or SIMD::BoolVector(Bool, 256)
. This is similar to #unsafe_as
, but no pointers are involved because the cast may take place entirely inside a register. Bitcasts between pointer and non-pointer vectors are deliberately left undefined because pointers do not have a platform-independent fixed size..unaligned_load
and #unaligned_store
are analogs of Pointer(T)#value
and #value=
, except the former pair assumes byte alignment, whereas the latter pair follows that of T
. They are necessary since e.g. x86-64 MOVAPS
segfaults on misaligned addresses, but they are not exclusive to vectors per se; any overaligned type will benefit from this.vector_zip
primitive, which requires the operand types to be identical. Many can be made to use binary
instead with some effort; others map to LLVM intrinsics.vector_reduce
primitive is for reduction intrinsics defined by LLVM, like BoolVector#all?
.samples/simd/intrinsics/**
define the Intel intrinsics in terms of the vector types. They are a mixture of vector operations, target-specific LLVM intrinsics, and Clang compiler built-ins, all ported from Clang's x86intrin.h
family. Most notably, none of them require inline assembly. Although not strictly necessary, they facilitate porting SIMD code in C that invokes the Intel intrinsics directly (which is the majority of them).x86_has_sse41
and x86_has_avx2
. #14524 is supposed to take care of this in the compiler itself.SIMD::Vector
deliberately does not include Indexable
, even though vectors are seemingly compatible with immutable StaticArray
s. If your vectors are used like that, chances are the right tools are not being used for the right tasks, and your code won't benefit from data parallelism.#==
between vectors (this requires changing vector_zip
to binary
as mentioned above), overflow-checked arithmetic, correctly aligned allocations, C ABI support.samples/simd/utf8_valid_encoding.cr
is the actual benchmark code for different implementations of String#valid_encoding?
, parsing 1000 random 65536-byte valid UTF-8 strings. Run it with samples/simd/run_benchmarks.sh
on an x86-64 machine with AVX2 support to see the speedup in action:
Baseline:
old 4.46 (224.14ms) (± 0.17%) 0.0B/op 8.25× slower
scalar 36.79 ( 27.18ms) (± 0.12%) 0.0B/op fastest
x86-64-v2:
old 4.51 (221.85ms) (± 0.18%) 0.0B/op 18.23× slower
scalar 36.79 ( 27.18ms) (± 0.14%) 0.0B/op 2.23× slower
sse4.1 82.16 ( 12.17ms) (± 1.44%) 0.0B/op fastest
x86-64-v3:
old 4.35 (229.86ms) (± 0.55%) 0.0B/op 26.51× slower
scalar 56.95 ( 17.56ms) (± 0.27%) 0.0B/op 2.03× slower
sse4.1 83.78 ( 11.94ms) (± 0.58%) 0.0B/op 1.38× slower
avx2 115.33 ( 8.67ms) (± 1.40%) 0.0B/op fastest
scalar
is the stdlib implementation since #12145, old
is the one before. The current implementation actually relies on the BMI2 extension from x86-64-v3 for the best performance. The AVX2 version in turn is 2-3 times as fast as that. (This is insufficient for #11873 as we most likely also want a faster #each_char
.)
I would like to hear some initial feedback first; if we agree this is the general direction we want to take, I'll submit an RFC later and then we could sort out all the details.
I love it, awesome work @HertzDevil :heart_hands:
Starting an RFC sounds like a good idea.
Hello. Have you support for atomics and SIMD in Crystal language?