KomputeProject / kompute

General purpose GPU compute framework built on Vulkan to support 1000s of cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends). Blazing fast, mobile-enabled, asynchronous and optimized for advanced GPU data processing usecases. Backed by the Linux Foundation.
http://kompute.cc/
Apache License 2.0
2.01k stars 155 forks source link

Explore / discuss for potential ideas or improvements #52

Open axsaucedo opened 4 years ago

axsaucedo commented 4 years ago

Open issue to openly discuss potential ideas or improvements, whether on documentation, interfaces, examples, bug fixes, etc.

DTolm commented 4 years ago

Hello, I am Dmitrii, the creator of VkFFT and Vulkan version of Spirit. I saw your comment and I believe your project is the way to go, if Vulkan wants to be popular in compute or, specifically, scientific field. There has to be some kind of a layer that will move them as further as possible from the way I developed Vulkan Spirit. There are some important things, that have to be clarified in the absolute beginning, which are related to the architectural problems and how this layer should be designed. These things are based on my experience and I have big faith in them, though some people may disagree and they have the right to do so. I. Target audience.

  1. Unless the user understands how the GPU works, the code will most certainly be bad, no matter if it is Vulkan or CUDA. The amount of information, such as threads, blocks, memory strides, shared memory bank conflicts, register spilling and how to mitigate the negative effects of them is enormous. The user who is not willing to do research will stick to CUDA, I can almost guarantee you. Thus aiming at them is pointless and their code will be slow and unusable anyway. This doesn't apply to users who are only willing to use third-party frameworks based on good code, like tensorflow. These people can, however, be also omitted, as they don't really interact with either Vulkan or CUDA directly.
  2. People who understand the GPU or willing to do so. The gap between groups can be huge but people with considerable CUDA experience can easily switch to Vulkan. These people will aim at the best possible performance, so they will always doubt this layer and consider to switch to pure Vulkan, if it doesn't perform in the best possible way (this is how I would feel and why I haven't chosen sth like MLIR). Luckily, I know how this can be avoided. II. How I believe the layer should work
  3. You can't prevent people from writing shader code. The amount of memory transferred from VRAM to the chip during an iteration/command buffer execution is the main limiting factor on 99% of GPU compute tasks (in VkFFT memory transfers take 80% of compute time). So, your solution of making a shader - C-command binding will always be not perfect, as the command most likely can be merged with the previous one and performed while data is still on chip. There are two solutions to this. a)Create a collection of Vulkan simple shader primitives (reduce, scan) and just publish the code, so people can use and modify them as they want. If the code is only 1-2 shader long, they should be able to copy the code to their own shaders, as it will reduce memory transfers. This is hard for user, but will yield the best possible performance in the end. Take a look at how VP, LBFGS and other solvers are created in Vulkan Spirit (see Vulkan_Compute.hpp). Each reduce command is merged with the last compute call there and all commands are merged between reduce calls. The boilerplate of creating them, though, is the thing that should be avoided in the layer in some form. b)Create a C/C++ to glsl command converter script, but, in my opinion, this is next to impossible to do in a way so it doesn't limit pure shader functionality. But it is more user-friendly.
  4. If there are any big libraries created, their inclusion should be done as an apply command, which adds pipeline dispatches to the user's command buffer. The difference between the library and the primitive is that the user doesn't have to understand the details of how the library works, while the primitive is fairly simple(see: reduce, scan). These libraries should still be accessible according to 1a, though their code may be self-contained and not copied like the collection 1a (but it can still be modified inside the library). Take a look at VkFFT. You call VkFFTAppend and it adds VkFFT to your command buffer. This is almost the same way as cuFFT works. Adding native zeropadding and being able to access every axis and command space between them is what makes it unique and why Vulkan Spirit is up to 3x faster than mumax3. Making the interface as an apply will also reduce command buffer creation/call overhead.
  5. Aim at zero GPU on CPU dependency during the execution. This is done in Vulkan Spirit, as it doesn't use the CPU after command buffer creation for anything, but asynchronous data saves from the GPU. GPU is connected through PCI-E 3/4 bus which has only 15/30 GB/s of bandwidth. This is at least 20 times lower than GPU - on-chip memory transfer and at least another 20 times lower than on-chip data transfers. Stopping and recreating command buffers takes time and should be avoided. Vulkan Spirit doesn't do that at all during the iteration process. These are the ideas that come right into my mind. Hope they will help, I will expand on them in the future. I am happy to answer questions. Taking a look at VkFFT and Vulkan Spirit will calrify most of the points I made, even though the code is not that clean. Best regards, Tolmachev Dmitrii
DTolm commented 4 years ago

I have also created a post on Vulkan reddit, people there may be interested too. https://www.reddit.com/r/vulkan/comments/iods0i/how_an_abstraction_layer_for_a_vulkan_compute/

axsaucedo commented 4 years ago

This is very useful insights @DTolm, thank you very much for sharing your thoughts! And thanks for extending the discussion into the Vulkan subreddit, I didn't know that sub existed but looks very useful.

In regards to your points, here are my thoughts on your points (numbered)

1. Users looking for GPU optimization framework

The user who is not willing to do research will stick to CUDA, I can almost guarantee you.

I totally see what you mean, and I agree, I don't think it would make sense to make this framework target people that don't want to introduce optimizations that Vulkan provides. The initial motivation for this framework is primarily due to seeing quite a few people writing a lot of similar code to abstract specialized non-NVIDIA GPU hardware (such as mobile) for advanced data processing such as ML.

2. Catering for Vulkan developers

These people will aim at the best possible performance, so they will always doubt this layer and consider to switch to pure Vulkan, if it doesn't perform in the best possible way (this is how I would feel and why I haven't chosen sth like MLIR). Luckily, I know how this can be avoided

I totally agree with you, that is exactly why I wanted to drive forward with the BYOV (bring your own vulkan) principle, where it should augment the capabilities of Vulkan developers through powerful abstractions, without limiting lower level access to the Vulkan APIs. I would be very keen on exploring the best way to ensure Kompute doesn't get on the way, and provides a baseline for people to be able to work from efficiently (increasing developer workflow efficiency).

3. Shader code

a)Create a collection of Vulkan simple shader primitives (reduce, scan) and just publish the code

I could not agree more, this is one of the main motivaiton for the concepts of kp::Operations in Kompute. I also was able to create a set of tooling that convert the SPIRV IR into C++ header files that are compiled with the binary, and the objective is to identify a set of baseline kp::Operations such as kp::OpAlgoMult, OpAlgoSum, etc (here is the code for the generated shader header file) that provide baseline capabilities for these type of usecases, whilst still providing the interface for users to build their own (both dynamic and static).

b)Create a C/C++ to glsl command converter script

I see what you mean, I have been researching whether there are any tools that can be used to write shader via C++, however there is potentially an opportunity to provide these type of abstractions at a higher level; that is, once users are able to build a large number of kp::Operations, these could be abstracted with higher level languages, such as through python bindings. However that's probably still something to explore - I did think of implementing kp::Sequence as an AST instead of a linear sequence of operations, but that's probably something that could eb explored at some point.

4. Library primitives

These libraries should still be accessible according to 1a, though their code may be self-contained and not copied like the collection 1a (but it can still be modified inside the library). Take a look at VkFFT.

I totally agree, and I am very curious to dive into the VkFTT codebase as that does sound quite interesting. This is something that is currently being explored in Kompute, by exposing the ability to "pre-record" kp::Operations using kp::Sequence once on program startup for example, and then call dynamically just through sq->eval() without having to re-record commands.

5. Zero GPU on CPU dependency

Aim at zero GPU on CPU dependency during the execution. This is done in Vulkan Spirit, as it doesn't use the CPU after command buffer creation for anything, but asynchronous data saves from the GPU.

This sounds really interesting - I'm not sure I fully understand tho, what do you mean by asynchronous data saves, is this specifically into host visible memory? Or what do you mean by asynchronous data saves from the GPU. If it's referered to "recreating command buffers", I defnitely know what you mean, I would be keen to hear your thoughts on this - currently this is achieved through operations like kp::OpSyncDevice and kp::OpSyncLocal which create staging buffers only when creted but every time the sequence is evaluated (which is equivallent to a command buffer queue submit) the staging and device buffers are re-used, allowing for command buffer creation to be avoided.

---

Thank you very much for taking the time for sharing your thoughts @DTolm - these are very interesting points, I would be keen to hear any further thoughts!

dkgaraujo commented 3 years ago

Hi Alejandro - congratulations on the EthicalML's work with the Kompute. It really does simplify the use of Vulkan. One suggestion that I think could help you guys take it to the next level is to try and implement it as a low-level backend to one of the main deep learning libraries (TensorFlow and PyTorch), similar to what Apple did with TensorFlow for MacOS recently. This would enable an incredibly larger share of ML-interested folks to harness the power of their GPUs while benefitting from the existing highly-developed ecosystems around these libraries. Another alternative route is to do the same, but with functional programming packages such as PyMC3 and others, which could really benefit from GPU acceleration.

Anyway, just some thoughts, along with my continued encouragement.

axsaucedo commented 3 years ago

@dkgaraujo I hugely appreciate your suggestions, and I could not agree more! The initial motivations (https://github.com/EthicalML/vulkan-kompute#motivations) that led to the creation of this project were exactly those - it would be an absolutely fantastic milestone to explore integrating Kompute as the backend of one of the existing main deep learning libraries. If this is something that you have knowledge on, I would be keen to get some pointers of what would be the best library to start with. At this point Pytorch does seem to be growing in popularity so it could be a good place to start. Do you have experience in the C++ backend of pytorch by any chance? If not I can open an issue for now and start documenting initial investigations there.

dkgaraujo commented 3 years ago

Many thanks for the positive feedback, @axsaucedo. Pytorch does indeed seem like a good place to start, although of course TensorFlow also would come with an ecosystem of functionalities. Now, while unfortunately my C++ skills are almost nil for practical purposes, looking at the source code of PyTorch, TensorFlow, and RStudio's implementation of PyTorch in R (mlverse/torch), my subjective impression is that perhaps the latter could be a good place to start given that the source code appears to be more streamlined (again, my subjective impression and probably correlated with the fact that R Torch is not a wrapper on PyTorch, but a new implementation altogether).

Another possibility, if the team wants to test the waters before embarking on a more ambitious project, could be to implement Vulkan Kompute as the backend of a more streamlined neural network library; an example of which that recently cross my path was iperov/litenn. It basically uses numpy together with an OpenCL backend, so that could be more amenable to a first try at using Kompute as a neural network backend and possibly help scout out any design or bugs in the process, thus laying the ground for using it as backend to the major libraries.

alexander-g commented 3 years ago

Another option (something I am planning to do) is to write a backend for Jax. Jax re-implements Numpy with additional features like gradient calculations and just-in-time compilation of functions. The compiled functions are basically a list of primitives like dot, conv etc which are forwarded to the default backend XLA. One would only need to implement those primitives in Vulkan. Then one could use a NN library like Flax or Elegy (which I am also contributing to) which are based on Jax. I have not really started yet, only done some basic tests. Will start soon, stay tuned.

axsaucedo commented 3 years ago

@alexander-g that sounds quite exciting, I would be very keen to get your thoughts of what may be required in order to achieve the integration as backend for Jax. Mainly as previously when performing integrations like the Android JNI and Godot Module integrations required further features to be in place. One of the things that are still outstanding is to extend further feature completeness on Vulkan features, such as enabling for shader types beyond buffers (image2d, image3d, etc), or data types beyond floats (int, int32, unint, etc), or even further support for native operations (currently I have only implemented the op_mult (e.g. op_sum, op_log, etc). Please do let me know if you run into any blockers, and I woudl certainly be interested in your findings as well.

Separate to this, I will be doing a talk on Vulkan & Kompute during the upcoming FOSDEM 2021 (https://fosdem.org/2021/) in the HPC / Data Science Track, and would be very keen to showcase some of these findings then if there is any progress - there's still quite a bit of time until then, so would be great to explore further until, and of course also after then.

@dkgaraujo thank you for the pointers to the other implementations, I agree that other smaller libraries could be an interesting route as well. I will also have a look at this, and potentially take the initial usecase with Jax that Alexander is looking at as a first starter to explore what are the features and requirements in the roadmap to enable for these type of usecases. Speed/efficiency will also be a key one, so optimizations that ensure best performance will also be a key component, especially with the python bindings.

unexploredtest commented 3 years ago

Right now, one of Vulkan Kompute's really great advantages is that it's light weight and easy to install, but as it gains more features and capabilities, the size might eventually grow in size, in which some features might end up not being used by everyone, but have to be included. So, how about creating a section, possibly another repository(ies), that people can choose extensions from if they needed? That way, the core of Vulkan Kompute will remain light weigh&simple and also, it will have lots of features available.

axsaucedo commented 3 years ago

@aliPMPAINT good point, I think we came across this issue when @alexander-g started exploring adding the GLSL shader compilation. HAving said that, this is less of an actual extension and more like utility functions. This would however make a lot of sense for things like Operations. I would be very keen if people are intereseted on contributing operations (such as a FFT, or parallel sum aggregate (#27). At this point I would still be happy to add these operations to the main repo, but I do think at some point it would make sense to have them in a different Operations repo. I'll add an issue fo this.

axsaucedo commented 3 years ago

One thing that I would be interested to ask everyone, is for feedback / thoughts on the talks for the FOSDEM conference I had mentioned earlier. I have finished recording the talks for both talks around Kompute, and would be very keen to get thoughts / ideas around either these videos or potentially other material (as well as ways to share it with the community).

One thing to mention is that in the videos, the intro / motivations, vulkan overview is almost the same, so I'll add a time where you'll be able to skip to for the KOmpute content.

Talk 1

Track FOSDEM Python

Talk Title: "Beyond CUDA: GPU Accelerated Python on Cross-Vendor Graphics Cards with Vulkan & Kompute"

Video (skip to 11.00 for Kompute section): https://www.youtube.com/watch?v=AJRyZ09IUdg

Talk 2

Track: FOSDEM HPC Track

Talk Title: GPU computing using Vulkan & Kompute for Cross-vendor Graphic Cards (AMD, Qualcomm, NVIDIA & friends)

Video (Skip to 13:33): https://www.youtube.com/watch?v=Xz4fiQNmGSA

By the way, the FOSDEM conference is free to attend, so I certainly recommend checking out other tracks / talks if there is interest!

unexploredtest commented 3 years ago

Just watched the video, and I actually think that it's thoughtful and comprehensive. One note though is that what got me really interested in this project, is the fact that it it's cross-vendor and cross-platform(as mentioned), and this can really attract AMD GPU users, like myself. I recently have bought a laptop(with AMD Radeon RX 5500m, big mistake), and wanted to train some ML models written in a library(TF/Pytorch) with GPU, but CUDA is Nvidia only, ROCm doesn't support Navi series yet even though a year has passed(though they said that they will add the support this year) and DirectML is windows-only and just supports Tensorflow 1.15 for now, which is inconvenient. So, this project is a kind of life savor for me(even though there hasn't been a ML library implemented yet) and it was my main motivation for participating, to put an end for this issue once and for all. Just thought about giving thoughts on what might get people more interested, as I think this was just mentioned at the end(which actually makes sense, as there are no support for any library yet). Thank you very much for making this happen.

axsaucedo commented 3 years ago

@aliPMPAINT that's great positive feedback, thank you for sharing the key reasons why you found interest in the Kompute framework. These are also the principles that drive our motivation to continue furthering the features and functionality of this framework - namely 1) integrate the Kompute framework into a popular ML / Scientific toolkit to enable for cross-vendor (and mobile) ML, and 2) contribute to the ongoing discussion around the Vulkan SDK and the topic of open source cross-vendor General Purpose GPU computing. Looking forward to continue working with this great community to expand and further these and the rest of the core principles of Kompute!

alexander-g commented 3 years ago

An update from my side: I've went public with my vkJAX project, a JAX interpreter for Vulkan. As of now it only covers an incomplete subset of all JAX/XLA primitives but it's already enough for ResNet50 inference with the Elegy framework. Moreover, it is very slow, even slower than JAX' CPU backend (which is based on BLAS/LAPACK and thus very optimized). The current development focus lies on compatibility, later I will optimize for speed.

axsaucedo commented 3 years ago

This is absolutely awesome @alexander-g! I will have a proper look today and share further thoughts, but this is amazing, especially the ResNet50 example, that looks EPIC!

In regards to the point on speed, that makes absolute sense, I think there are several optimizations that can be explored on the library as well as on Vulkan Kompute that can ensure we can achieve as optimal performance as possible. Really keen to dive further into this - I have also identified some interesting areas of optimization for how the Tensors are used, which may be useful to explore further.

axsaucedo commented 3 years ago

Following up on this thread, I want to request further thoughts on the road towards 1.0 - currently we have been able to extend the capabilities to broader Vulkan capabilities making this discussion more tangible. I have added an issue to capture the current discussions as well as a project where the issues will be tracked - it would be great to hear people's thoughts / ideas:

mauvray commented 3 years ago

Hello, I got fed up by ROCM a few times ago and started to look for vulkan alternatives and discovered this project. First I want to say that I really like the possibility to have a cross OS/GPU vendor project for gpgpu, and I started to learn a bit of c++ (and vulkan later on) to be able to contribute in the future. I was thinking : should the project implements or use an existing BLAS/LAPACK lib, and some other lib for dedicated GPGPU task ? It could bring vulkan kompute in par with CUDA or HIP (at least in usability). What do you think ?

ChenKuo commented 3 years ago

Hi, I am just exploring this project for cross-platform GPU computing I need for a project. I like how mgr.Sequence() makes it really easy to run code on GPU. However, I think there is not enough options for synchronization.

Right now the only synchronization options (that I can see) are running eval() synchronously or use eval_await() asynchronously. Both cause the thread to stop, which translate to a loss of time when it can be sending the next batch to queue. Vulkan 1.2 has the Timeline Semaphores API which seems to be a good solution if we can integrate it to Kompute API.

For example, suppose I have algorithm A using tensors a, algorithm B using tensors b, and algorithm C using tensors a, b, c. A and B are independent, but C is dependent on the result of A and B. We only need the result from C, not intermediate results from A and B. This is how I wish the code would look in Python: (I am not sure if my understanding of Timeline Semaphore is correct. It is kind of confusing.)

timeline_a = kp.TimelineSemaphore()
timeline_b = kp.TimelineSemaphore()
timeline_c = kp.TimelineSemaphore()

sequence
  .record(kp.OpTensorSyncDevice(params_a))
  .eval_async(timeline_a(wait=0, signal=1)) # copy params_a to device asap
  .record(kp.OpAlgoDispatch(algo_a))
  .eval_async(timeline_a(wait=1, signal=2)) # run algo_a after params_a is copied to device
  .record(kp.OpTensorSyncDevice(params_b))
  .eval_async(timeline_b(wait=0, signal=1)) # copy params_b to device asap
  .record(kp.OpAlgoDispatch(algo_b))
  .eval_async(timeline_b(wait=1, signal=2)) # run algo_b after params_b is copied to device
  .record(kp.OpTensorSyncDevice(c))
  .eval_async(timeline_c(wait=0, signal=1)) # copy params_c to device asap
  .record(kp.OpAlgoDispatch(algo_c)) 
  .eval_async(
      timeline_a(wait=2, signal=4),
      timeline_b(wait=2, signal=4),
      timeline_c(wait=1, signal=2)) #run algo_c after algo_a and algo_b finish, and params_c is copied
  .record(kp.OpTensorSyncLocal(params_c))
  .eval_async(timeline_c(wait=2, signal=3)) # copy params_c to host after algo_c is done
  .eval_await(timeline_c(wait=3, signal=4))  # wait for params_c to be copied to host
# now we can use the result from C on host
print( param.data() for param in params_c)

There is a (partial) workaround by creating multiple threads and Sequence objects, so one thread-Sequence can move data around while the other is waiting. However, this still does not solve the dependency issue, I think. I am not an expert in Vulkan or C++, so what I wrote may be wrong. Maybe there is a better way I do not know of. If you know please let me know. Thanks.

axsaucedo commented 3 years ago

Hello, I got fed up by ROCM a few times ago and started to look for vulkan alternatives and discovered this project. First I want to say that I really like the possibility to have a cross OS/GPU vendor project for gpgpu, and I started to learn a bit of c++ (and vulkan later on) to be able to contribute in the future. I was thinking : should the project implements or use an existing BLAS/LAPACK lib, and some other lib for dedicated GPGPU task ? It could bring vulkan kompute in par with CUDA or HIP (at least in usability). What do you think ?

I think that's a good idea, currently we're exploring creating a library of "kernels" as operations that can be reused, but the idea would be that higher level SDKs can be developed on top of Kompute, or use it as a backend to provide more advanced use-case specific interfaces

axsaucedo commented 3 years ago

@ChenKuo I think that's a really good idea actually. I am thinking there may be a way to provide a higher level abstraction, but that would be a good principle to set it on. To be more specific, we already support Memory barriers which enable for control in the GPU itself, and as you pointed out, we currently support fences to allow for host synchronisation namely through the eval_async / eval_await. In this case I think adding the semaphore functionality would make complete sense, I will open an issue to continue the discussion there.

axsaucedo commented 3 years ago

@ChenKuo I have just opened #238 to continue the discussion, it would be great if you can provide further thoughts, and also provide some insights of whether the current OpMemoryBarrier could actually help you address the current work without the need for the semaphore timelines. You can see an example of this here:

https://github.com/KomputeProject/kompute/blob/master/test/TestMultipleAlgoExecutions.cpp#L99-L115

        std::shared_ptr<kp::OpMemoryBarrier> shaderBarrier{
                  new kp::OpMemoryBarrier({ tensorA },
                  vk::AccessFlagBits::eTransferRead,
                  vk::AccessFlagBits::eShaderWrite,
                  vk::PipelineStageFlagBits::eComputeShader,
                  vk::PipelineStageFlagBits::eComputeShader)
        };

        mgr.sequence()
          ->record<kp::OpTensorSyncDevice>({ tensorA })
          ->record<kp::OpAlgoDispatch>(mgr.algorithm({ tensorA }, spirv))
          ->record(shaderBarrier)
          ->record<kp::OpAlgoDispatch>(mgr.algorithm({ tensorA }, spirv))
          ->record(shaderBarrier)
          ->record<kp::OpAlgoDispatch>(mgr.algorithm({ tensorA }, spirv))
          ->record<kp::OpTensorSyncLocal>({ tensorA })
          ->eval();
ChenKuo commented 3 years ago

@axsaucedo Thanks for your response. I see how I can use OpMemoryBarrier to implement dependency. This way it can also submit everything in one batch, so it should be more efficient than the coarse-grain synchronization by using semaphores. The way I think semaphores would be useful is we can synchronized across different queues. So we can use the result of one queue in the other queue. For example, we can run algo_a in queue1 and algo_b in queue2, then use the result from algo_a and algo_b to run algo_c in queue3.

ChenKuo commented 3 years ago

I think that's a good idea, currently we're exploring creating a library of "kernels" as operations that can be reused, but the idea would be that higher level SDKs can be developed on top of Kompute, or use it as a backend to provide more advanced use-case specific interfaces

In my opinion, trying to create a library of premade "kernels" will not be useful for developing "higher level SDKs" or "use-case specific interfaces", or at least not beyond the prototyping phase, because of the following reasons:

  1. You can never create enough premade kernels for each use case. Even if you only aim to support the most basic of operations, a slight variation will require a different kernel (i.e.: float vs. int vs uint8). And for each combination, the number increases exponentially. (If I understand right, your "kernel" means a SPIR-V shader)
  2. The premade kernels would probably not be optimized for each different use case, (unless they can be combined and compiled down to a single shader. But I think this is not what you mean). And there isn't anything the user could do to optimize either, beside remaking the kernel. I can't speak for everyone but I think the users of this project are mainly developers trying to optimize, and this goes against their intent.
  3. To expand on point 2, it is not easy to implement a system for building complex operations from the base set of kernels in an user-friendly manner. There are notable examples for such a system (Tensorflow, Pytorch), but from what I understand, your goal is not to create a machine learning framework, but to create a low level framework for building the GPU backend for these sort of higher level frameworks (*).

Based on (*), I think the direction you should go for is make writing a custom shader the primary method for creating an Algorithm. And try to make the process for doing so as easy as possible.

Create a C/C++ to glsl command converter script, but, in my opinion, this is next to impossible to do in a way so it doesn't limit pure shader functionality. But it is more user-friendly.

I also think this is beyond the scope of this project. The responsibility of writing shaders and making sure they work should rest on the users. However, we can add utilities to help the user write reusable and composable shader code, while still giving the user full control of their shader code. Some idea I have in mind are shader factory method, shader templates, shader functions import, basic validation, and integration helper to kp::Alogrithm. The user can create their custom templates, but we can provide some basic templates as well. There are several advantage to this approach:

  1. The utility developer requires only a little knowledge of basic shader syntax, and do not need to implement any shader operation. Also it is very easy to implement, as it is just a string manipulation method in essence. I can probably build a prototype in a weekend.
  2. The user can build a shader using factory method at high level with existing templates, without even seeing any shader code. Or they can create new templates with as low level of control as they need to go.
  3. Templates are highly composable and reusable. Simple templates can be combined to any degree of complexity. We can build a library of templates.
  4. We do not need to make any change in Kompute API, since the shader compilation do not need to happen at runtime. There is no performance cost at all.

If you want I can make a more detailed proposal later on, if we think this way is worth exploring.

mauvray commented 3 years ago

I am still learning c++, not even vulkan, glsl or gpgpu yet so I guess I look at a full SDK. To explain myself : I was thinking of doing something like CuBLAS or CUSolver or their ROCm equivalent but I am not sure how they work or how to use them in gpgpu. I think I can see why your idea is best.

axsaucedo commented 3 years ago

Thank you for sharing your thoughts @ChenKuo , I do agree in large part with your sentiment, and I do feel like there should be a core focus on making Kompute focus on serving as a flexible backend for higher level frameworks. I do feel like there would still be value for providing two things:

  1. A library of prebuilt Kernels that implement various different applications, together with respective benchmarks (potentially as separate repo) that could serve as a base for people to be able to extend to fit their use-cases, as I completely agree that most of the default kernles / operations would require some further optimization to work for relevant usecases.
  2. As suggested, we started work on a OperationAlgoFactory that aims to provide templating logic to make it simpler to support multiple variations of a single operation, such as allowing for multiple types to be supported as inputs, however the ideal implementation has not been identified yet, so any thoughts / suggestions woudl be very much appreciated - this is the PR https://github.com/KomputeProject/kompute/pull/173
ChenKuo commented 3 years ago

@axsaucedo I do not understand the use case of OperationAlgoFactory very well. I think some code examples (tentative is fine) would help us understand how it is going to be used. Does it generate shaders dynamically for different types? Or do we pre-generate all possible variations of each shader, and load them on demand? From your code in the method rebuildAlgorithmFromFactory I only see code for multiplication for uint32 type? I am guessing when you switch tensor type you are also rebuilding the entire Algorithm object. I still do not know Vulkan well enough to tell what is the best way, but I think it should let us keep multiple versions of the shader in memory (save it in a cache maybe) and switch whenever necessary, and only rebuild when there is a cache miss. I need to learn more about Vulkan first and dig deeper into your code first.