Open axsaucedo opened 4 years ago
Hello, I am Dmitrii, the creator of VkFFT and Vulkan version of Spirit. I saw your comment and I believe your project is the way to go, if Vulkan wants to be popular in compute or, specifically, scientific field. There has to be some kind of a layer that will move them as further as possible from the way I developed Vulkan Spirit. There are some important things, that have to be clarified in the absolute beginning, which are related to the architectural problems and how this layer should be designed. These things are based on my experience and I have big faith in them, though some people may disagree and they have the right to do so. I. Target audience.
I have also created a post on Vulkan reddit, people there may be interested too. https://www.reddit.com/r/vulkan/comments/iods0i/how_an_abstraction_layer_for_a_vulkan_compute/
This is very useful insights @DTolm, thank you very much for sharing your thoughts! And thanks for extending the discussion into the Vulkan subreddit, I didn't know that sub existed but looks very useful.
In regards to your points, here are my thoughts on your points (numbered)
The user who is not willing to do research will stick to CUDA, I can almost guarantee you.
I totally see what you mean, and I agree, I don't think it would make sense to make this framework target people that don't want to introduce optimizations that Vulkan provides. The initial motivation for this framework is primarily due to seeing quite a few people writing a lot of similar code to abstract specialized non-NVIDIA GPU hardware (such as mobile) for advanced data processing such as ML.
These people will aim at the best possible performance, so they will always doubt this layer and consider to switch to pure Vulkan, if it doesn't perform in the best possible way (this is how I would feel and why I haven't chosen sth like MLIR). Luckily, I know how this can be avoided
I totally agree with you, that is exactly why I wanted to drive forward with the BYOV (bring your own vulkan) principle, where it should augment the capabilities of Vulkan developers through powerful abstractions, without limiting lower level access to the Vulkan APIs. I would be very keen on exploring the best way to ensure Kompute doesn't get on the way, and provides a baseline for people to be able to work from efficiently (increasing developer workflow efficiency).
a)Create a collection of Vulkan simple shader primitives (reduce, scan) and just publish the code
I could not agree more, this is one of the main motivaiton for the concepts of kp::Operations
in Kompute. I also was able to create a set of tooling that convert the SPIRV IR into C++ header files that are compiled with the binary, and the objective is to identify a set of baseline kp::Operations
such as kp::OpAlgoMult
, OpAlgoSum
, etc (here is the code for the generated shader header file) that provide baseline capabilities for these type of usecases, whilst still providing the interface for users to build their own (both dynamic and static).
b)Create a C/C++ to glsl command converter script
I see what you mean, I have been researching whether there are any tools that can be used to write shader via C++, however there is potentially an opportunity to provide these type of abstractions at a higher level; that is, once users are able to build a large number of kp::Operations
, these could be abstracted with higher level languages, such as through python bindings. However that's probably still something to explore - I did think of implementing kp::Sequence
as an AST instead of a linear sequence of operations, but that's probably something that could eb explored at some point.
These libraries should still be accessible according to 1a, though their code may be self-contained and not copied like the collection 1a (but it can still be modified inside the library). Take a look at VkFFT.
I totally agree, and I am very curious to dive into the VkFTT codebase as that does sound quite interesting. This is something that is currently being explored in Kompute, by exposing the ability to "pre-record" kp::Operations
using kp::Sequence
once on program startup for example, and then call dynamically just through sq->eval()
without having to re-record commands.
Aim at zero GPU on CPU dependency during the execution. This is done in Vulkan Spirit, as it doesn't use the CPU after command buffer creation for anything, but asynchronous data saves from the GPU.
This sounds really interesting - I'm not sure I fully understand tho, what do you mean by asynchronous data saves, is this specifically into host visible memory? Or what do you mean by asynchronous data saves from the GPU. If it's referered to "recreating command buffers", I defnitely know what you mean, I would be keen to hear your thoughts on this - currently this is achieved through operations like kp::OpSyncDevice
and kp::OpSyncLocal
which create staging buffers only when creted but every time the sequence is evaluated
(which is equivallent to a command buffer queue submit) the staging and device buffers are re-used, allowing for command buffer creation to be avoided.
Thank you very much for taking the time for sharing your thoughts @DTolm - these are very interesting points, I would be keen to hear any further thoughts!
Hi Alejandro - congratulations on the EthicalML's work with the Kompute. It really does simplify the use of Vulkan. One suggestion that I think could help you guys take it to the next level is to try and implement it as a low-level backend to one of the main deep learning libraries (TensorFlow and PyTorch), similar to what Apple did with TensorFlow for MacOS recently. This would enable an incredibly larger share of ML-interested folks to harness the power of their GPUs while benefitting from the existing highly-developed ecosystems around these libraries. Another alternative route is to do the same, but with functional programming packages such as PyMC3 and others, which could really benefit from GPU acceleration.
Anyway, just some thoughts, along with my continued encouragement.
@dkgaraujo I hugely appreciate your suggestions, and I could not agree more! The initial motivations (https://github.com/EthicalML/vulkan-kompute#motivations) that led to the creation of this project were exactly those - it would be an absolutely fantastic milestone to explore integrating Kompute as the backend of one of the existing main deep learning libraries. If this is something that you have knowledge on, I would be keen to get some pointers of what would be the best library to start with. At this point Pytorch does seem to be growing in popularity so it could be a good place to start. Do you have experience in the C++ backend of pytorch by any chance? If not I can open an issue for now and start documenting initial investigations there.
Many thanks for the positive feedback, @axsaucedo. Pytorch does indeed seem like a good place to start, although of course TensorFlow also would come with an ecosystem of functionalities. Now, while unfortunately my C++ skills are almost nil for practical purposes, looking at the source code of PyTorch, TensorFlow, and RStudio's implementation of PyTorch in R (mlverse/torch), my subjective impression is that perhaps the latter could be a good place to start given that the source code appears to be more streamlined (again, my subjective impression and probably correlated with the fact that R Torch is not a wrapper on PyTorch, but a new implementation altogether).
Another possibility, if the team wants to test the waters before embarking on a more ambitious project, could be to implement Vulkan Kompute as the backend of a more streamlined neural network library; an example of which that recently cross my path was iperov/litenn. It basically uses numpy together with an OpenCL backend, so that could be more amenable to a first try at using Kompute as a neural network backend and possibly help scout out any design or bugs in the process, thus laying the ground for using it as backend to the major libraries.
Another option (something I am planning to do) is to write a backend for Jax.
Jax re-implements Numpy with additional features like gradient calculations and just-in-time compilation of functions. The compiled functions are basically a list of primitives like dot
, conv
etc which are forwarded to the default backend XLA. One would only need to implement those primitives in Vulkan. Then one could use a NN library like Flax or Elegy (which I am also contributing to) which are based on Jax.
I have not really started yet, only done some basic tests. Will start soon, stay tuned.
@alexander-g that sounds quite exciting, I would be very keen to get your thoughts of what may be required in order to achieve the integration as backend for Jax. Mainly as previously when performing integrations like the Android JNI and Godot Module integrations required further features to be in place. One of the things that are still outstanding is to extend further feature completeness on Vulkan features, such as enabling for shader types beyond buffers (image2d, image3d, etc), or data types beyond floats (int, int32, unint, etc), or even further support for native operations (currently I have only implemented the op_mult (e.g. op_sum, op_log, etc). Please do let me know if you run into any blockers, and I woudl certainly be interested in your findings as well.
Separate to this, I will be doing a talk on Vulkan & Kompute during the upcoming FOSDEM 2021 (https://fosdem.org/2021/) in the HPC / Data Science Track, and would be very keen to showcase some of these findings then if there is any progress - there's still quite a bit of time until then, so would be great to explore further until, and of course also after then.
@dkgaraujo thank you for the pointers to the other implementations, I agree that other smaller libraries could be an interesting route as well. I will also have a look at this, and potentially take the initial usecase with Jax that Alexander is looking at as a first starter to explore what are the features and requirements in the roadmap to enable for these type of usecases. Speed/efficiency will also be a key one, so optimizations that ensure best performance will also be a key component, especially with the python bindings.
Right now, one of Vulkan Kompute's really great advantages is that it's light weight and easy to install, but as it gains more features and capabilities, the size might eventually grow in size, in which some features might end up not being used by everyone, but have to be included. So, how about creating a section, possibly another repository(ies), that people can choose extensions from if they needed? That way, the core of Vulkan Kompute will remain light weigh&simple and also, it will have lots of features available.
@aliPMPAINT good point, I think we came across this issue when @alexander-g started exploring adding the GLSL shader compilation. HAving said that, this is less of an actual extension and more like utility functions. This would however make a lot of sense for things like Operations. I would be very keen if people are intereseted on contributing operations (such as a FFT, or parallel sum aggregate (#27). At this point I would still be happy to add these operations to the main repo, but I do think at some point it would make sense to have them in a different Operations
repo. I'll add an issue fo this.
One thing that I would be interested to ask everyone, is for feedback / thoughts on the talks for the FOSDEM conference I had mentioned earlier. I have finished recording the talks for both talks around Kompute, and would be very keen to get thoughts / ideas around either these videos or potentially other material (as well as ways to share it with the community).
One thing to mention is that in the videos, the intro / motivations, vulkan overview is almost the same, so I'll add a time where you'll be able to skip to for the KOmpute content.
Video (skip to 11.00 for Kompute section): https://www.youtube.com/watch?v=AJRyZ09IUdg
Video (Skip to 13:33): https://www.youtube.com/watch?v=Xz4fiQNmGSA
By the way, the FOSDEM conference is free to attend, so I certainly recommend checking out other tracks / talks if there is interest!
Just watched the video, and I actually think that it's thoughtful and comprehensive. One note though is that what got me really interested in this project, is the fact that it it's cross-vendor and cross-platform(as mentioned), and this can really attract AMD GPU users, like myself. I recently have bought a laptop(with AMD Radeon RX 5500m, big mistake), and wanted to train some ML models written in a library(TF/Pytorch) with GPU, but CUDA is Nvidia only, ROCm doesn't support Navi series yet even though a year has passed(though they said that they will add the support this year) and DirectML is windows-only and just supports Tensorflow 1.15 for now, which is inconvenient. So, this project is a kind of life savor for me(even though there hasn't been a ML library implemented yet) and it was my main motivation for participating, to put an end for this issue once and for all. Just thought about giving thoughts on what might get people more interested, as I think this was just mentioned at the end(which actually makes sense, as there are no support for any library yet). Thank you very much for making this happen.
@aliPMPAINT that's great positive feedback, thank you for sharing the key reasons why you found interest in the Kompute framework. These are also the principles that drive our motivation to continue furthering the features and functionality of this framework - namely 1) integrate the Kompute framework into a popular ML / Scientific toolkit to enable for cross-vendor (and mobile) ML, and 2) contribute to the ongoing discussion around the Vulkan SDK and the topic of open source cross-vendor General Purpose GPU computing. Looking forward to continue working with this great community to expand and further these and the rest of the core principles of Kompute!
An update from my side: I've went public with my vkJAX project, a JAX interpreter for Vulkan. As of now it only covers an incomplete subset of all JAX/XLA primitives but it's already enough for ResNet50 inference with the Elegy framework. Moreover, it is very slow, even slower than JAX' CPU backend (which is based on BLAS/LAPACK and thus very optimized). The current development focus lies on compatibility, later I will optimize for speed.
This is absolutely awesome @alexander-g! I will have a proper look today and share further thoughts, but this is amazing, especially the ResNet50 example, that looks EPIC!
In regards to the point on speed, that makes absolute sense, I think there are several optimizations that can be explored on the library as well as on Vulkan Kompute that can ensure we can achieve as optimal performance as possible. Really keen to dive further into this - I have also identified some interesting areas of optimization for how the Tensors are used, which may be useful to explore further.
Following up on this thread, I want to request further thoughts on the road towards 1.0 - currently we have been able to extend the capabilities to broader Vulkan capabilities making this discussion more tangible. I have added an issue to capture the current discussions as well as a project where the issues will be tracked - it would be great to hear people's thoughts / ideas:
Hello, I got fed up by ROCM a few times ago and started to look for vulkan alternatives and discovered this project. First I want to say that I really like the possibility to have a cross OS/GPU vendor project for gpgpu, and I started to learn a bit of c++ (and vulkan later on) to be able to contribute in the future. I was thinking : should the project implements or use an existing BLAS/LAPACK lib, and some other lib for dedicated GPGPU task ? It could bring vulkan kompute in par with CUDA or HIP (at least in usability). What do you think ?
Hi,
I am just exploring this project for cross-platform GPU computing I need for a project. I like how mgr.Sequence()
makes it really easy to run code on GPU. However, I think there is not enough options for synchronization.
Right now the only synchronization options (that I can see) are running eval()
synchronously or use eval_await()
asynchronously. Both cause the thread to stop, which translate to a loss of time when it can be sending the next batch to queue. Vulkan 1.2 has the Timeline Semaphores API which seems to be a good solution if we can integrate it to Kompute API.
For example, suppose I have algorithm A using tensors a, algorithm B using tensors b, and algorithm C using tensors a, b, c. A and B are independent, but C is dependent on the result of A and B. We only need the result from C, not intermediate results from A and B. This is how I wish the code would look in Python: (I am not sure if my understanding of Timeline Semaphore is correct. It is kind of confusing.)
timeline_a = kp.TimelineSemaphore()
timeline_b = kp.TimelineSemaphore()
timeline_c = kp.TimelineSemaphore()
sequence
.record(kp.OpTensorSyncDevice(params_a))
.eval_async(timeline_a(wait=0, signal=1)) # copy params_a to device asap
.record(kp.OpAlgoDispatch(algo_a))
.eval_async(timeline_a(wait=1, signal=2)) # run algo_a after params_a is copied to device
.record(kp.OpTensorSyncDevice(params_b))
.eval_async(timeline_b(wait=0, signal=1)) # copy params_b to device asap
.record(kp.OpAlgoDispatch(algo_b))
.eval_async(timeline_b(wait=1, signal=2)) # run algo_b after params_b is copied to device
.record(kp.OpTensorSyncDevice(c))
.eval_async(timeline_c(wait=0, signal=1)) # copy params_c to device asap
.record(kp.OpAlgoDispatch(algo_c))
.eval_async(
timeline_a(wait=2, signal=4),
timeline_b(wait=2, signal=4),
timeline_c(wait=1, signal=2)) #run algo_c after algo_a and algo_b finish, and params_c is copied
.record(kp.OpTensorSyncLocal(params_c))
.eval_async(timeline_c(wait=2, signal=3)) # copy params_c to host after algo_c is done
.eval_await(timeline_c(wait=3, signal=4)) # wait for params_c to be copied to host
# now we can use the result from C on host
print( param.data() for param in params_c)
There is a (partial) workaround by creating multiple threads and Sequence
objects, so one thread-Sequence can move data around while the other is waiting. However, this still does not solve the dependency issue, I think. I am not an expert in Vulkan or C++, so what I wrote may be wrong. Maybe there is a better way I do not know of. If you know please let me know.
Thanks.
Hello, I got fed up by ROCM a few times ago and started to look for vulkan alternatives and discovered this project. First I want to say that I really like the possibility to have a cross OS/GPU vendor project for gpgpu, and I started to learn a bit of c++ (and vulkan later on) to be able to contribute in the future. I was thinking : should the project implements or use an existing BLAS/LAPACK lib, and some other lib for dedicated GPGPU task ? It could bring vulkan kompute in par with CUDA or HIP (at least in usability). What do you think ?
I think that's a good idea, currently we're exploring creating a library of "kernels" as operations that can be reused, but the idea would be that higher level SDKs can be developed on top of Kompute, or use it as a backend to provide more advanced use-case specific interfaces
@ChenKuo I think that's a really good idea actually. I am thinking there may be a way to provide a higher level abstraction, but that would be a good principle to set it on. To be more specific, we already support Memory barriers which enable for control in the GPU itself, and as you pointed out, we currently support fences to allow for host synchronisation namely through the eval_async / eval_await. In this case I think adding the semaphore functionality would make complete sense, I will open an issue to continue the discussion there.
@ChenKuo I have just opened #238 to continue the discussion, it would be great if you can provide further thoughts, and also provide some insights of whether the current OpMemoryBarrier could actually help you address the current work without the need for the semaphore timelines. You can see an example of this here:
https://github.com/KomputeProject/kompute/blob/master/test/TestMultipleAlgoExecutions.cpp#L99-L115
std::shared_ptr<kp::OpMemoryBarrier> shaderBarrier{
new kp::OpMemoryBarrier({ tensorA },
vk::AccessFlagBits::eTransferRead,
vk::AccessFlagBits::eShaderWrite,
vk::PipelineStageFlagBits::eComputeShader,
vk::PipelineStageFlagBits::eComputeShader)
};
mgr.sequence()
->record<kp::OpTensorSyncDevice>({ tensorA })
->record<kp::OpAlgoDispatch>(mgr.algorithm({ tensorA }, spirv))
->record(shaderBarrier)
->record<kp::OpAlgoDispatch>(mgr.algorithm({ tensorA }, spirv))
->record(shaderBarrier)
->record<kp::OpAlgoDispatch>(mgr.algorithm({ tensorA }, spirv))
->record<kp::OpTensorSyncLocal>({ tensorA })
->eval();
@axsaucedo Thanks for your response. I see how I can use OpMemoryBarrier to implement dependency. This way it can also submit everything in one batch, so it should be more efficient than the coarse-grain synchronization by using semaphores. The way I think semaphores would be useful is we can synchronized across different queues. So we can use the result of one queue in the other queue. For example, we can run algo_a in queue1 and algo_b in queue2, then use the result from algo_a and algo_b to run algo_c in queue3.
I think that's a good idea, currently we're exploring creating a library of "kernels" as operations that can be reused, but the idea would be that higher level SDKs can be developed on top of Kompute, or use it as a backend to provide more advanced use-case specific interfaces
In my opinion, trying to create a library of premade "kernels" will not be useful for developing "higher level SDKs" or "use-case specific interfaces", or at least not beyond the prototyping phase, because of the following reasons:
Based on (*), I think the direction you should go for is make writing a custom shader the primary method for creating an Algorithm
. And try to make the process for doing so as easy as possible.
Create a C/C++ to glsl command converter script, but, in my opinion, this is next to impossible to do in a way so it doesn't limit pure shader functionality. But it is more user-friendly.
I also think this is beyond the scope of this project. The responsibility of writing shaders and making sure they work should rest on the users. However, we can add utilities to help the user write reusable and composable shader code, while still giving the user full control of their shader code. Some idea I have in mind are shader factory method, shader templates, shader functions import, basic validation, and integration helper to kp::Alogrithm
. The user can create their custom templates, but we can provide some basic templates as well.
There are several advantage to this approach:
If you want I can make a more detailed proposal later on, if we think this way is worth exploring.
I am still learning c++, not even vulkan, glsl or gpgpu yet so I guess I look at a full SDK. To explain myself : I was thinking of doing something like CuBLAS or CUSolver or their ROCm equivalent but I am not sure how they work or how to use them in gpgpu. I think I can see why your idea is best.
Thank you for sharing your thoughts @ChenKuo , I do agree in large part with your sentiment, and I do feel like there should be a core focus on making Kompute focus on serving as a flexible backend for higher level frameworks. I do feel like there would still be value for providing two things:
@axsaucedo I do not understand the use case of OperationAlgoFactory very well. I think some code examples (tentative is fine) would help us understand how it is going to be used. Does it generate shaders dynamically for different types? Or do we pre-generate all possible variations of each shader, and load them on demand? From your code in the method rebuildAlgorithmFromFactory
I only see code for multiplication for uint32
type? I am guessing when you switch tensor type you are also rebuilding the entire Algorithm
object. I still do not know Vulkan well enough to tell what is the best way, but I think it should let us keep multiple versions of the shader in memory (save it in a cache maybe) and switch whenever necessary, and only rebuild when there is a cache miss. I need to learn more about Vulkan first and dig deeper into your code first.
Open issue to openly discuss potential ideas or improvements, whether on documentation, interfaces, examples, bug fixes, etc.