SJTU-IPADS / reef

REEF is a GPU-accelerated DNN inference serving system that enables instant kernel preemption and biased concurrent execution in GPU scheduling.
Apache License 2.0
81 stars 6 forks source link

REEF for NVIDIA GPUs #7

Open anakli opened 1 year ago

anakli commented 1 year ago

Really interesting work :) Would it be possible to have access to the version of REEF for NVIDIA GPUs that you mention in the paper? Do you plan to make the NVIDIA GPU version open source or is it possible for researchers to get access to a separate repository with that version of REEF?

Thank you!

francis0407 commented 1 year ago

Hi @anakli, Thank you for your interest in our work.

However, I have to clarify that the NVIDIA version of REEF only implemented the task preemption mechanism based on queue cleaning, and did not include all of the techniques in REEF. As such, it is not currently fully functional, and we do not plan to make it open-source or provide access to a separate repository.

That being said, we will soon be open-sourcing a preemption library that we extracted from REEF-N, which works on NVIDIA GPU with CUDA. This library will assist other inference systems in implementing preemption functionality, similar to what is available in REEF-N. Once this library is available, developers will be able to implement preemption capabilities in other systems similar to what is offered by REEF-N.

anakli commented 1 year ago

Thank you for the quick response!

Do you have an expected timeline for when you plan to release the preemption library extracted from REEF-N?

In the meantime, we can also prototype the approach described in Section 4.4 of the paper. I'm wondering about the following two parameters:

Thanks!

francis0407 commented 1 year ago

Do you have an expected timeline for when you plan to release the preemption library extracted from REEF-N? Maybe

We plan to release it within the next two months. We are currently finalizing some additional features and working on code and documentation orgnizations.

In the meantime, we can also prototype the approach described in Section 4.4 of the paper. I'm wondering about the following two parameters:

  • what size do you assume for the vHQ (how many kernel slots?)

The vHQ is indeed implemented as a linked list, which means that there is no specific limitation on its size. Therefore, you can add as many kernel slots as you need.

  • what is the "fixed number" of kernels you submit at a time from the vHQ to the GPU runtime?

The number of GPU kernels within the GPU runtime should be related to the workload's characteristics. There is typically a trade-off between execution latency and preemption latency. As such, we recommend keeping the number of GPU kernels within the range of 4 to 16, which is a reasonable trade-off between preemption and execution latency.

anakli commented 1 year ago

Thank you!

ujay-zheng commented 11 months ago

@francis0407 I would like to ask whether the Reef-N mentioned before has implemented DKP? If not, is it because it's not implementable on Nvidia?(I read through the paper and tried to implement DKP on Nvidia, but my shallow ability is not enough to judge the possibility of this solution on Nvidia, and most of the work in the paper is based on AMD graphics cards, so I have such doubts.)If DKP can be implemented on Nvidia, I will learn to implement it. If not, I'd like to know what problems you encountered during the implementation.

ujay-zheng commented 11 months ago

I had a very rough look at the LLVM User Guide for AMDGPU and the User Guide for NVPTX Back-end. With my shallow knowledge, I guess it won't work on Nvidia GPU.

francis0407 commented 11 months ago

Hi @ujay-zheng ,

We didn't implement DKP in REEF-N on NVIDIA GPU. It is mainly because many optimizations in DKP need to modify the binary or assembly code of the GPU kernel. For example, when "call" the candidate kernel inside the proxy kernel, we use "jump" instruction instead of "call" to avoid register spilling.

Actually, DKP can be implemented on NVIDIA GPU, but with a lot engineering effort (i.e., hacking the CUDA SASS binary).

ujay-zheng commented 11 months ago

ok i got it, thank you!

pokerfaceSad commented 10 months ago

@francis0407 How is REEF-N going? Already published?

Hi @anakli, Thank you for your interest in our work.

However, I have to clarify that the NVIDIA version of REEF only implemented the task preemption mechanism based on queue cleaning, and did not include all of the techniques in REEF. As such, it is not currently fully functional, and we do not plan to make it open-source or provide access to a separate repository.

That being said, we will soon be open-sourcing a preemption library that we extracted from REEF-N, which works on NVIDIA GPU with CUDA. This library will assist other inference systems in implementing preemption functionality, similar to what is available in REEF-N. Once this library is available, developers will be able to implement preemption capabilities in other systems similar to what is offered by REEF-N.

atomicapple0 commented 7 months ago

Bump on this. I am interested in playing around with the device queue capacity restriction feature for nvidia gpus. @francis0407

Alex4210987 commented 1 month ago

Hi! It's very interesting work, and I wonder if it can be run on nvida gpus or amd gpus other than AMD RADEON INSTINCTâ„¢ MI50 ?