eth-easl / orion

An interference-aware scheduler for fine-grained GPU sharing
MIT License
78 stars 12 forks source link

How to run on A100? #31

Open ZSL98 opened 4 months ago

ZSL98 commented 4 months ago

Is the dockerfile in the latest_cuda_changes branch runnable on A100? It seems that the container built with the dockerfile in the main branch has some problems running your AE fig7 script 'python run_orion.py', reporting the error when I run it on my A100 machine:

CUDA Runtime Error at: intercept_temp.cpp:453
Error 209, no kernel image is available for execution on the device
ZSL98 commented 4 months ago

I have tried that dockerfile, the torch patch seems not compatible with the pytorch version 7bcf7da3a268b435777fe87c7794c382f444e86d

ZSL98 commented 4 months ago

Can you provide the patch for a newer pytorch version? That would be helpful. Thanks!

fotstrt commented 4 months ago

Hello, thank you for your comment! No, the dockerfile is not ready yet. We are working on open-sourcing a version of Orion compatible with A100 GPUs. The AE fig7 was run on a V100 GPU. I expect the version for A100 GPUs (supporting cuda versions >10.2) will be out in the next few weeks.

kzos commented 3 months ago

I too got same error on any other GPU's other than v100's maybe.

Tried it on 3070 and a100, both same error, (no kernel image available)

CUDA Runtime Error at: intercept_temp.cpp:453 Error 209, no kernel image is available for execution on the device python3.8: intercept_temp.h:805: void check(T, const char, const char, int) [with T = cudaError]: Assertion `err == cudaSuccess' failed. Aborted (core dumped)


kindly @fotstrt please advise whether non docker path works for 3070 and A100, ?

TIA.

fotstrt commented 3 months ago

This will be addressed in the following 2 weeks. Thank you!

jiashu-z commented 1 month ago

This will be addressed in the following 2 weeks. Thank you!

Hi, I wonder if this is already addressed. I would like to try Orion on CUDA 12.1. Would you please point me to the correct branch? Is it fot/latest_cuda_changes? Thanks!

fotstrt commented 1 month ago

Hello, there has been a delay, sorry about that.

The branch fot/latest_cuda_changes contains a Dockerfile: https://github.com/eth-easl/orion/blob/fot/latest_cuda_changes/setup/Dockerfile_Cuda12 where i have tested some basic Orion functionality, but not fully tested the system yet (that's why it is not merged).

I plan to do more tests and merge soon.

jzxycsjzy commented 1 month ago

Hello, there has been a delay, sorry about that.

The branch fot/latest_cuda_changes contains a Dockerfile: https://github.com/eth-easl/orion/blob/fot/latest_cuda_changes/setup/Dockerfile_Cuda12 where i have tested some basic Orion functionality, but not fully tested the system yet (that's why it is not merged).

I plan to do more tests and merge soon.

Thanks for your reply. I tried this Dockerfile to build a CUDA12.1 version image, but it reports many errors. I create the container following the guidance in INSTALL.md bud it does not work.

jzxycsjzy commented 3 weeks ago

Hello, there has been a delay, sorry about that. The branch fot/latest_cuda_changes contains a Dockerfile: https://github.com/eth-easl/orion/blob/fot/latest_cuda_changes/setup/Dockerfile_Cuda12 where i have tested some basic Orion functionality, but not fully tested the system yet (that's why it is not merged). I plan to do more tests and merge soon.

Thanks for your reply. I tried this Dockerfile to build a CUDA12.1 version image, but it reports many errors. I create the container following the guidance in INSTALL.md bud it does not work.

Moreover, the reason for these errors is that cuDNN lib is not linked or installed correctly.