cogciprocate / ocl

OpenCL for Rust
Other
716 stars 75 forks source link

Where are the vector types? #13

Closed bfops closed 8 years ago

bfops commented 8 years ago

E.g. cl_float3. Where are they? It kind of looks like they're not defined, which is problematic.

c0gent commented 8 years ago

There are currently no 'special' vector types on the host side as there are within kernels.

EDIT: For future readers: This hasty response is flawed for this particular example; in OpenCL a float3 is actually a float4. I should have used float4 or int8 as an example type but my brain only really works occasionally.

let a_float3: [f32; 3] = [0.0, 1.0, 2.0]; is how you would declare an individual float3 on the host. This would be used when passing an individual vector as a kernel parameter.

For vectors of vectors (multi-dimensional arrays) on the host, simply create one large vector of the appropriate type and size (ex: let some_float3s: Vec<f32> = Vec::with_capacity(data_size * 3);).

Does this answer your question or have I missed your meaning?

bfops commented 8 years ago

There are no 'special' vector types on the host side as there are within kernels.

This builds and runs fine on my machine:

#include <stdio.h>
#include <OpenCL/opencl.h>

int main() {
  cl_float4 x = {1, 2, 3, 4};
  cl_float3 y = {5, 6, 7};
  cl_float2 z = {8, 9};
  printf("sizeof(cl_float4): %u\n", sizeof(cl_float4));
  printf("sizeof(cl_float3): %u\n", sizeof(cl_float3));
  printf("sizeof(cl_float2): %u\n", sizeof(cl_float2));
  x.w = 11;
  y.z = 314;
  return 0;
}

cl_platform.h definitely makes an effort to define these types, at least in some circumstances. They're also mentioned in the vector data types docs:

The built-in vector data types are also declared as appropriate types in the OpenCL API (and header files) that can be used by an application.


For vectors of vectors (multi-dimensional arrays) on the host, simply create one large vector of the appropriate type and size (ex: let some_float3s: Vec = Vec::with_capacity(data_size * 3);).

This is one of the reasons with the cl_float3 type is important on the host side: because it's not necessarily defined as [f32; 3]. On my machine at least, cl_float3 is typedefd to cl_float4 on the host side, and float3s do actually take up four floats, not three. Finding that out was.. painful, and it's the same time I found out that I should really be using the vector data types on the host side. The output from the program above is:

sizeof(cl_float4): 16
sizeof(cl_float3): 16
sizeof(cl_float2): 8
c0gent commented 8 years ago

Yes good point. I've always managed vector types manually but you're right, for vector types that don't have power of two sizes (such as float3) it is confusing to work with for the newcomer, especially when someone gives a hastily oversimplified and inaccurate example as I did above.

I agree, having vector data types on the host will simplify their use. I will begin porting over and integrating cl_platform.h as soon as time permits.

For now, when manually managing vector types on the host, just pad vector types to the next power of two size (vec3 -> vec4, vec5 -> vec8, etc.) just like in OpenGL.

Just out of curiousity, Are there situations outside of dealing with image pixel color elements where these irregular vector types are useful on the host? What other kinds of things do you tend to use them for in OpenCL?

bfops commented 8 years ago

I agree, having vector data types on the host will simplify their use. I will begin porting over and integrating cl_platform.h as soon as time permits.

Awesome, thanks! I'm using luqmana/rust-opencl right now, but once this is done, I'll definitely make the switch over (plus this repo is actually maintained).

For now, when manually managing vector types on the host, just pad vector types to the next power of two size (vec3 -> vec4, vec5 -> vec8, etc.) just like in OpenGL.

Yeah, I just wrote my own cl_float3 type. Although FYI you also need to enforce OpenCL's alignment constraints! (again, suuuuuper fun to diagnose that). This is how I wrote it.

Just out of curiousity, Are there situations outside of dealing with image pixel color elements where these irregular vector types are useful on the host? What other kinds of things do you tend to use them for in OpenCL?

I'm building a toy raytracer, so it's come up for colors/intensities, but also 3D points & vectors.

c0gent commented 8 years ago

@bfops I've initially implemented vector types. They are named in the idiomatic rust style (i.e. cl_float4 -> ClFloat4, cl_uint8 -> ClUint8, etc.) and are available in the ocl::aliases module for now (pending rename... possibly to just ocl::types).

Would you be willing to help me test this a little bit more? Perhaps just to let me know if it works for you and what you think before I publish it to crates.io? If so, just set your Cargo.toml to use this github repo as a dependency source and let me know.

EDIT: I've dealt with alignment the same way the standard OpenCL headers do, by simply aliasing float4 as a float3. Was there any particular drawback to this solution that prompted you to manually align it using repr the way you did?

Obviously using a vector type within a 'parent' struct will require alignment if passed to the kernel whole but doing this is not an idiomatic or robust way to pass data to OpenCL anyway. Because of this I don't think it's something I want the library to have to manage, leaving it to the consumer. I'm willing to be convinced otherwise.

EDIT2: I went ahead and burned the rest of the afternoon messing around with this. I forgot that type aliases don't work with tuple struct constructors and it would be weird to have a useless 4th parameter anyway so I went ahead and implemented a ::new() for all of the vec3 types. I then made the fourth member private so that the tuple struct constructor can not be used. I don't like this but it works.

I will, of course, eventually add a ::new for every type but I'm still not really satisfied with the fact the constructors will be slightly inconsistent. The alternative is removing the very convenient tuple struct constructors for all of the types.

Since you prompted me to finally add vector types to the library after convincing myself it wasn't necessary, I'm going to try to make you help design it properly, heh.

bfops commented 8 years ago

Hey, sorry for the turnaround time. I've been trying to stay on top of school. Anyway, first I need to switch my "test case" over to your version of the OpenCL bindings, so that'll add in a bit of delay!

I've dealt with alignment the same way the standard OpenCL headers do, by simply aliasing float4 as a float3. Was there any particular drawback to this solution that prompted you to manually align it using repr the way you did?

Does float4 only enforce 4-byte alignment? OpenCL requires 16-byte alignment IIRC. Edit: The OpenCL data types page says types should be aligned to multiples of their whole size, except for *3 types, which should be aligned as though they were *4. Rust treats them differently, which is why I had to use the SIMD hackery. (Honestly, as far as Rust hackery goes, that's some of the least offensive).

Obviously using a vector type within a 'parent' struct will require alignment if passed to the kernel whole but doing this is not an idiomatic or robust way to pass data to OpenCL anyway. Because of this I don't think it's something I want the library to have to manage, leaving it to the consumer. I'm willing to be convinced otherwise.

Oh, really? That's how I've been doing it, and it's been working okay. I think it's supposed to be particularly nice when using C with OpenCL, because then you can roughly copy-paste the struct definition in both files.

I went ahead and implemented a ::new() for all of the vec3 types. I then made the fourth member private so that the tuple struct constructor can not be used. I don't like this but it works.

That's roughly what I did, I don't really think there's a better way. I think there are probably constraints that the fourth element has to be zero or something, since e.g. dot of two float3s should obviously act as though there's no fourth element.

c0gent commented 8 years ago

Appreciate your feedback, take all the time you need. Get that school out of the way so you can start the real learning :)

So to the alignment / struct issue. I'll try not to go in to too much detail and I apologize if this is all stuff you already know but I just want to give you my understanding of how it works and let me know if you agree or have any other questions.

Does float4 only enforce 4-byte alignment? OpenCL requires 16-byte alignment IIRC.

You have to have each value aligned on a boundary equal to it's size. For example: you don't want to have some 4-byte value (such as char4) aligned on an 2-byte boundary such as at address 0x02 or 0x06. You want that value to have a 4-byte alignment and to sit at 0x00, 0x04, 0x08, etc. Same thing for 16-byte values such as float4. You want it at 0x00 or 0x10, not at 0x08 or 0x18.

The reason for this is very simply efficiency/speed. If a memory controller can always assume this is the case, it saves a few extra steps and allows a more streamlined and simplified design. When you're loading 16 or 64, or some large number of values from memory all at the same time like GPUs do, every little step is huge.

Obviously using a vector type within a 'parent' struct will require alignment if passed to the kernel whole but doing this is not an idiomatic or robust way to pass data to OpenCL anyway. Because of this I don't think it's something I want the library to have to manage, leaving it to the consumer. I'm willing to be convinced otherwise.

Oh, really? That's how I've been doing it, and it's been working okay. I think it's supposed to be particularly nice when using C with OpenCL, because then you can roughly copy-paste the struct definition in both files.

Ok so there are a few reasons not to pass structs to OpenCL. The big one is performance. When data is in a struct, regardless of how its aligned, there will be arbitrarily sized gaps between values of the same kind where the other data in the struct lives. Because of the way memory controllers on GPUs work, and the way that work items in work groups are batch processed, these gaps generally slow down memory reads, sometimes significantly. Parallel memory reads always work best if data of the same type is tightly packed, because the GPU can then choose from a variety of optimized methods of reading it. This varies from device to device obviously, and almost none of this applies to CPUs which from what I understand are still reading from memory in a more or less serial fashion. Another important part of this picture is cache performance which is something that affects CPUs and GPUs alike and is generally hurt by the all-in-a-struct layout due to it often loading a bunch of data it won't use right away because it's right next to data it needs.

The other reason is portability and the potential unreliability of precise struct memory layout. This may not be an issue in Rust and its possible that modern OpenCL implementations now all handle these cases perfectly but it's difficult or impossible to guarantee that every platform on every machine will always lay the struct out in the right way and that any future changes you make to it won't disturb this very fragile condition.

Long story short... break up those structs into individual arrays. Yes, it's slightly less convenient. Yes, you'll no longer be able to copy & paste definitions. Yes, you'll have to create separate buffers for each value and separate kernel parameters for each one and, yes, there is a slightly higher cost in time and effort involved with that but... it's worth it. It's the right way to do it.

bfops commented 8 years ago

Re: alignment: The OpenCL data types page says types should be aligned to multiples of their whole size, except for 3 types, which should be aligned as though they were 4. Rust treats them differently, which is why I had to use the SIMD hackery. (Honestly, as far as Rust hackery goes, that's some of the least offensive).

Because of the way memory controllers on GPUs work, and the way that work items in work groups are batch processed, these gaps generally slow down memory reads, sometimes significantly.

Goooood point, I'm still unlearning all the things that aren't true in GPU land. I'm porting my raytracer to CPU anyway, because it's not really well suited to OpenCL, but next time I'll go the struct-of-arrays route.

Switching over my test case right now!

c0gent commented 8 years ago

Re: alignment: The OpenCL data types page says types should be aligned to multiples of their whole size, except for 3 types, which should be aligned as though they were 4. Rust treats them differently, which is why I had to use the SIMD hackery. (Honestly, as far as Rust hackery goes, that's some of the least offensive).

Yeah definitely. Just to clarify, and please correct me if I'm wrong, I don't think you needed to use repr(simd) and the extra spacer field because of anything to do with differences between vec3 and vec4s. I believe that ended up working for you simply because when you put that vector within another struct you were having problems getting the fields to lay out in such a way that OpenCL would read. As I mentioned, and I'm sure you might agree now, putting those vector types in the structs is really the issue here, not anything about the types themselves.

bfops commented 8 years ago

I was/am using the repr(simd) types to enforce that both float3 and float4 are properly aligned to 16 bytes. pub struct ClFloat4(pub f32, pub f32, pub f32, pub f32); doesn't force 16-byte alignment, which can be an issue for arrays as well as structs, no? Unless arrays are required to be pointer-size-aligned?

As I mentioned, and I'm sure you might agree now, putting those vector types in the structs is really the issue here, not anything about the types themselves.

It's definitely not optimal, and I might change it going forward, but I don't think it's a bug. AFAIK the behavior is entirely well-defined as long as the well-defined alignment constraints are met, and they're not respected by ClFloat3 or ClFloat4 right now.

c0gent commented 8 years ago

I was/am using the repr(simd) types to enforce tha both float3 and float4 are properly aligned to 16 bytes. pub struct ClFloat4(pub f32, pub f32, pub f32, pub f32); doesn't force 16-byte alignment, which can be an issue for arrays as well as structs, no? Unless arrays are required to be pointer-size-aligned?

Well the alignment is based on the position of beginning of the buffer (array), not the actual physical memory. In a buffer containing identical elements which are power of two sizes, they are, by definition, aligned regardless of anything else.

As I mentioned, and I'm sure you might agree now, putting those vector types in the structs is really the issue here, not anything about the types themselves.

It's definitely not optimal, and I might change it going forward, but I don't think it's a bug. AFAIK the behavior is entirely well-defined as long as the well-defined alignment constraints are met, and they're not respected by ClFloat3 or ClFloat4 right now.

Absolutely. I didn't at all mean to imply you should never use these irregular structs, nor that it's a bug. If you're careful and you consider all of the implications of doing it that way you're just fine. There are plenty of situations where the caveats I listed earlier don't really matter or apply.

bfops commented 8 years ago

Well the alignment is based on the position of beginning of the buffer (array), not the actual physical memory. In a buffer containing identical elements which are power of two sizes, they are, by definition, aligned regardless of anything else.

For some reason I'm having the hardest time parsing this. [T] is only required to be aligned the same as T, right? In which case, arrays of ClFloat4s stand a chance of being misaligned, as well as structs.

Absolutely. I didn't at all mean to imply you should never use these irregular structs, nor that it's a bug. If you're careful and you consider all of the implications of doing it that way you're just fine. There are plenty of situations where the caveats I listed earlier don't really matter or apply.

Oops yeah sorry I not only miscommunicated, but I also got ahead of myself. First of all, yeah, you're right, the repr(simd) has nothing to do with 3-vs-4, but just with the alignment of T4 types (and consequently the T3 types).

And the part I just forgot to say, which I hope gives a little more context to the headspace I was in when I wrote my last comment: as of right now, my code using the new types is segfaulting, and I think it's because ClFloat4 doesn't have the right alignment constraints. It might be because of my sketchy transmutes (the constraints on OclPrm seem.. overzealous.. so I had to transmute my buffers to f32 buffers), but they're pretty straightforward.

c0gent commented 8 years ago

I'm enjoying this back and forth. It's helping clarify my understanding on a few of these things.

Well the alignment is based on the position of beginning of the buffer (array), not the actual physical memory. In a buffer containing identical elements which are power of two sizes, they are, by definition, aligned regardless of anything else.

For some reason I'm having the hardest time parsing this.

First of all, I failed to mention that I was specifically referring to OpenCL buffers but forget about that for a minute. Let's take a step back. You're right, this is complicated and confusing topic. Let me try to break this down a little.

So, the whole alignment issue matters because somewhere, some processor is grabbing a whole chunk of something at once, and processing it all at once, without breaking it up or manipulating it beforehand. If we're just doing normal, non-SIMD (aka. SISD) serial computing on the CPU, the data layout doesn't matter. The CPU can read from memory with byte level precision for every single piece of data one at a time. The data can be spaced out any damn way it pleases.

As soon as you start doing things in parallel using something like SIMD or OpenCL, things change. The individual element is now multiple elements so data must be packed together in a precise fashion and must often be in a precise place. You undoubtedly already understand all of this.

At the risk of boring you further, let me also quickly point out that certain memory layout things are handled differently depending on whether you're using a CPU or GPU as the OpenCL "device." Although there is always a "host" which runs on the CPU, the actual computing "device" can be either CPU or GPU. I'll note one or two of the differences in how CPUs and GPUs lay out memory in a moment but just keep in mind the distinction between "host" (CPU only) and "device" (either or both).

Ok so what did I mean when I said:

... the alignment is based on the position of beginning of the buffer (array), not the actual physical memory.

In OpenCL buffers are always created so that they start in an "aligned" position. The position of index 0 is always aligned correctly, no matter how you create it. This is true on either a GPU or CPU "device". Buffer creation is simple on GPUs since the runtime manages allocation of memory there and it ensures that buffers are always created in the right spot.

It is likewise impossible to create an OpenCL buffer in a non-aligned spot on a CPU "device" because the only requirement for alignment on a CPU is word-alignment. Since any non-trivial sized array (yes I'm excluding the silly exception of weird, tiny sized arrays here) will always be allocated in a word-aligned position, it will always be a valid and aligned spot for an OpenCL buffer.

[T] is only required to be aligned the same as T, right? In which case, arrays of ClFloat4s stand a chance of being misaligned, as well as structs.

To address this question, let's consider is the difference between, I'll call it "global" alignment, and "micro" alignment. Since we've seen that in OpenCL you can always assume that your global alignment is correct, the only other issue is the spacing of your component pieces, and potentially their component pieces.

It's easy to see how any array of scalars such as floats, ints, chars, etc. will always be tightly packed. Vector types, such as long16, uint8, or float4 will also always be tightly packed regardless of whether they are implemented as arrays or structs. Unless you specifically try to create a weirdly spaced vector type, (using Rust notation here) a [f32; 4] will always be the exact same thing as a (f32, f32, f32, f32) in memory. Any contiguous array (i.e.: Vec) of these types then, will therefore also be tightly packed.

Ok so in every case so far we've seen that we don't need to worry about alignment. Regardless of how I create a buffer and which scalar or vector types I fill it with, it will be aligned without any effort on my part.

That leaves one last case: an array (or buffer) of irregular structs. This is where all of the alignment promises we could take for granted go out of the window. I don't think I need to go any further... I think you can see how this story ends. Irregular structs now cause us to have to worry about alignment issues in addition to the various performance problems they inevitably cause.

I'll just mention one last thing about OpenCL on CPUs. Modern OpenCL implementations on Intel or AMD platforms actually use SIMD instructions whenever possible when actually executing on the CPU. They do some cool stuff like auto-vectorizing your kernels for you when you compile them too. Sadly they can't do either of those things if your data is stored in an irregular struct, and must revert to the standard serial instructions.

I didn't mean to hammer on arrays of structs or anything, heh... I just couldn't help myself I guess. Hope that clarifies some stuff though.

c0gent commented 8 years ago

And the part I just forgot to say, which I hope gives a little more context to the headspace I was in when I wrote my last comment: as of right now, my code using the new types is segfaulting, and I think it's because ClFloat4 doesn't have the right alignment constraints. It might be because of my sketchy transmutes (the constraints on OclPrm seem.. overzealous.. so I had to transmute my buffers to f32 buffers), but they're pretty straightforward.

Yeah I forgot to address this... I'm happy to ease or remove constraints on OclPrm. They are mostly arbitrary. My initial instinct was to be highly restrictive and only allow scalars and vectors since that's all I ever use, figuring I could ease those restrictions if necessary. It's probably not the library's role to try to dictate what people put in their buffers though. I don't know which way to lean here.

bfops commented 8 years ago

In OpenCL buffers are always created so that they start in an "aligned" position. The position of index 0 is always aligned correctly, no matter how you create it. This is true on either a GPU or CPU "device". Buffer creation is simple on GPUs since the runtime manages allocation of memory there and it ensures that buffers are always created in the right spot.

It is likewise impossible to create an OpenCL buffer in a non-aligned spot on a CPU "device" because the only requirement for alignment on a CPU is word-alignment. Since any non-trivial sized array (yes I'm excluding the silly exception of weird, tiny sized arrays here) will always be allocated in a word-aligned position, it will always be a valid and aligned spot for an OpenCL buffer.

Thanks for clarifying! So it sounds like, except on systems with word sizes smaller than 16 bytes, arrays/buffers of primitive types will always be aligned properly.

That said, even though it's almost definitely ill-advised to use arrays-of-structs with OpenCL, I don't see a strong reason not to enforce the alignment constraints on the CPU side? It's strictly more correct, even though it effectively never matters. It's not a huge amount of work, and it would keep this and all future alignment quibbles out of your hair.

Yeah I forgot to address this... I'm happy to ease or remove constraints on OclPrm. They are mostly arbitrary. My initial instinct was to be highly restrictive and only allow scalars and vectors since that's all I ever use, figuring I could ease those restrictions if necessary. It's probably not the library's role to try to dictate what people put in their buffers though. I don't know which way to lean here.

The Copy constraint makes sense to me from the point of view that it implies the data is correct if it's bitwise-copied, but I dislike the implication that this copy is trivial. But it's only nontrivial with big structs, so I'd argue to leave it there.

I personally disagree with the ord, eq, etc constraints: the library doesn't know what the bits mean, or what they're being used for. Maybe the user is explicitly using them to describe a type for which those operations don't make sense (e.g. I don't have an intuitive definition for what a partial ordering on colors would represent), even though of course everything is ultimately a number.

I'm writing on my phone, so I'll try to remember to come back later and address anything I've missed. Cheers!

c0gent commented 8 years ago

... except on systems with word sizes smaller than 16 bytes, arrays/buffers of primitive types will always be aligned properly.

Not quite. So again, let's keep in mind the distinction between the alignment of the start of an array/buffer (I called it global) and the alignment of individual elements within it (micro). There is no 16 byte cutoff for word sizes or anything when we talk about global alignment. Don't get stuck on that number. Again, just talking about CPUs here (GPUs often have big, vectorized word sizes and are a whole different animal). A CPUs word size is generally either 4 bytes (32 bit) or 8 bytes (64 bit). So what I was referring to when I said:

It is likewise impossible to create an OpenCL buffer in a non-aligned spot on a CPU "device" because the only requirement for alignment on a CPU is word-alignment.

... is that for CPUs regardless of the size of their words, they could be 2-byte (16-bit) words or 32-byte (256-bit) words, when creating arrays, the start of that array is virtually always be considered "aligned". If I create an array on a 32-bit machine (with a 4-byte word size), it will by default have a starting address of some multiple of four (0x00, 0x04, 0x08 etc.). If I create an array on a 64-bit machine (with a 8-byte word size), it will by default have a starting address of some multiple of eight (0x00, 0x08, 0x10 etc.). It will have this because when you allocate memory, whether on the stack or the heap, it will always be word aligned automatically (again, ignoring weird cases where you purposely try to allocate it oddly).

Because allocation automatically word-aligns, and because the only requirement for the position of a start of an array is word-alignment, you can just assume that the start of every array is "aligned" no matter what the size of the elements in the array or what the word size of the machine is.

Now let's talk about what I called micro alignment, the positioning within the array. This is where the alignment can no longer be assumed to be correct for any element other than the first one (at index 0). Since we know that the start position (index 0) of any array is in an "aligned" spot, we only now need to worry about the position of everything else inside (again this is for any data type and any machine word size). In other words, it's possible to have element 0 aligned, but then somehow elements 1, 2, 3, etc. get out of alignment.

This is where the size of the type itself starts to matter and this is where you're getting the 16-byte thing from. Don't get stuck on that number. If we were talking about float8s or short16s we would be dealing with a 32-byte alignment.

Pictures here will help because I don't think words can quite convey this so I'll just grab some images from google that I think might help you picture this:

byte alignment

mis-alignment

SIMD-specific

Anyway I hope that helps a little, I'm sure my explanation isn't the greatest.

c0gent commented 8 years ago

The Copy constraint makes sense to me from the point of view that it implies the data is correct if it's bitwise-copied, but I dislike the implication that this copy is trivial. But it's only nontrivial with big structs, so I'd argue to leave it there.

I personally disagree with the ord, eq, etc constraints: the library doesn't know what the bits mean, or what they're being used for. Maybe the user is explicitly using them to describe a type for which those operations don't make sense (e.g. I don't have an intuitive definition for what a partial ordering on colors would represent), even though of course everything is ultimately a number.

Which trait is giving you trouble?

pub unsafe trait OclPrm: PartialEq + Copy + Clone + Default + Debug {}

I can get rid of PartialEq, Copy, and/or Default without much trouble but honestly if you're putting a type in there that can't simply [derive(PartialEq, Copy, Default)] then you should probably consider redesigning that type because those are pretty basic traits.

c0gent commented 8 years ago

I'll close this for now. Let me know what issues you're having or if you would like to request any changes to the library in a new issue.

bfops commented 8 years ago

Oh I'm with you now. Sorry, my brain's been fried for the last few days. I totally did not get bits and bytes confused for a minute there haha.

Sounds good, thanks for being patient with me!

c0gent commented 8 years ago

Any time :)

c0gent commented 8 years ago

Let me know how things in your ray tracer come along... I'd like to try it out.