Open JoeyBF opened 3 years ago
Maybe it would make sense to do this for larger matrices? My guess is that up to a certain size the overhead for setting up the GPU to do anything would exceed the benefit. Of course if you implement this you can benchmark it on varying size matrices to see.
Maybe it would make sense to do this for larger matrices?
I agree, probably something around 500 x 500 would be a reasonable cutoff to start with. We would also have to keep in mind how to manage the memory, because the matrices quickly become too big to store entirely in a single block's cache. Having to constantly fetch from the main memory would negate any potential benefit. Even so, the existence of something like cuBLAS shows that this can be worked around.
So would you think it's something worth working on? I think increasing scalability is important if we want to carry out big computations on something like an HPC grid.
I've looked into the Rust-GPU story every now and then but it doesn't seem very promising at the moment, unless we use C bindings.
On the other hand, if we only want to do this for a few functions, then it might not be reasonable to write that part in C.
So would you recommend using Rust-GPU anyway? I wouldn't mind writing some CUDA code and calling it from the fp crate. Or maybe we would rather use OpenCL or OpenMP to keep it completely open-source. I just want to discuss how to attack this and whether or not it's worth pursuing before diving into the code.
whether or not it's worth pursuing
In my opinion it's worth trying. It's really hard to know how much effort it will take or how much of a performance boost we'll get without someone trying. I think it'd be cool for someone to work out how to do this. Presumably you think it would be fun / interesting or else you wouldn't have suggested it. I personally would be interested to find out how it works out.
I spent a good amount of time writing webgl code for the chart rendering, which I would say was objectively much more of a waste of time than this would be.
I think OpenCL is preferable to CUDA, a lot of graphics cards aren't made by NVIDIA.
Presumably you think it would be fun / interesting or else you wouldn't have suggested it.
I do think it would be a fun challenge, and I think it would really help us push the computations if it pans out.
I think OpenCL is preferable to CUDA, a lot of graphics cards aren't made by NVIDIA.
That's true, but I worry that OpenCL might be too generic, if that makes sense. The way I see it, most people who will want to run an ext computation, at least in the range where using a GPU would become important, are probably going to be running the code on some kind of HPC cluster. As far as I know, most of the GPUs there are NVIDIA (e.g. Wayne has K20x'es, P100's and V100's).
Actually now that I'm looking into it more, something like Vulkan might be the right thing to use. Unlike OpenCL which covers GPUs as well as FPGAs, DSPs and more exotic stuff, Vulkan is GPU-only, so we can write at a lower level. And it also has some Rust support through crates like gfx-rs and vulkano.
something like Vulkan might be the right thing to use
Yeah Vulkan is supposed to be the new right way to do these things. Since it's pretty new there will probably be less support / how to guides / example code, and it's supposed to be extremely verbose.
I do wonder whether we aren't already limited more by memory access speeds than by CPU speed though: it's not clear that the basic operations involved are complicated enough to benefit from graphics card acceleration.
I'm a fan of being able to test things locally and in CI, and CUDA would be problematic on that front
I'm a fan of being able to test things locally and in CI, and CUDA would be problematic on that front
That's a good point. I think Vulkan would be the way to go then.
I do wonder whether we aren't already limited more by memory access speeds than by CPU speed though: it's not clear that the basic operations involved are complicated enough to benefit from graphics card acceleration.
What I'm thinking is that we could copy the entire matrix to the GPU's RAM, and then call some compute kernels that act on it. The copy is the real bottleneck, so calling the kernels should be pretty fast. Already one benefit is that the bandwidth between the GPU cores and the GPU memory is substantially bigger, so even if we use the same algorithm it might be faster. (The figure I saw was 100 GB/s for a CPU vs 380 GB/s for a GPU.)
I was also thinking of using something like the "Method of the Four Russians" that M4RI uses for row reduction. Presumably it is parallelizable, since the M4RI library can use up to 4 cores.
Sounds generally reasonable to me.
Since we make such heavy use of finite field linear algebra, would it make sense to write some compute kernels to offload some computations to a GPU? Maybe something like an F_p-cuBLAS? I think some routines are parallel enough to make it worthwhile. I see there's ticket #19 for making use of M4RI, but I wonder if there's even more performance to be squeezed out of a GPU.