Open ZSL98 opened 4 months ago
I have tried that dockerfile, the torch patch seems not compatible with the pytorch version 7bcf7da3a268b435777fe87c7794c382f444e86d
Can you provide the patch for a newer pytorch version? That would be helpful. Thanks!
Hello, thank you for your comment! No, the dockerfile is not ready yet. We are working on open-sourcing a version of Orion compatible with A100 GPUs. The AE fig7 was run on a V100 GPU. I expect the version for A100 GPUs (supporting cuda versions >10.2) will be out in the next few weeks.
I too got same error on any other GPU's other than v100's maybe.
Tried it on 3070 and a100, both same error, (no kernel image available)
CUDA Runtime Error at: intercept_temp.cpp:453 Error 209, no kernel image is available for execution on the device python3.8: intercept_temp.h:805: void check(T, const char, const char, int) [with T = cudaError]: Assertion `err == cudaSuccess' failed. Aborted (core dumped)
kindly @fotstrt please advise whether non docker path works for 3070 and A100, ?
TIA.
This will be addressed in the following 2 weeks. Thank you!
This will be addressed in the following 2 weeks. Thank you!
Hi, I wonder if this is already addressed. I would like to try Orion on CUDA 12.1. Would you please point me to the correct branch? Is it fot/latest_cuda_changes? Thanks!
Hello, there has been a delay, sorry about that.
The branch fot/latest_cuda_changes contains a Dockerfile: https://github.com/eth-easl/orion/blob/fot/latest_cuda_changes/setup/Dockerfile_Cuda12 where i have tested some basic Orion functionality, but not fully tested the system yet (that's why it is not merged).
I plan to do more tests and merge soon.
Hello, there has been a delay, sorry about that.
The branch fot/latest_cuda_changes contains a Dockerfile: https://github.com/eth-easl/orion/blob/fot/latest_cuda_changes/setup/Dockerfile_Cuda12 where i have tested some basic Orion functionality, but not fully tested the system yet (that's why it is not merged).
I plan to do more tests and merge soon.
Thanks for your reply. I tried this Dockerfile to build a CUDA12.1 version image, but it reports many errors. I create the container following the guidance in INSTALL.md bud it does not work.
Hello, there has been a delay, sorry about that. The branch fot/latest_cuda_changes contains a Dockerfile: https://github.com/eth-easl/orion/blob/fot/latest_cuda_changes/setup/Dockerfile_Cuda12 where i have tested some basic Orion functionality, but not fully tested the system yet (that's why it is not merged). I plan to do more tests and merge soon.
Thanks for your reply. I tried this Dockerfile to build a CUDA12.1 version image, but it reports many errors. I create the container following the guidance in INSTALL.md bud it does not work.
Moreover, the reason for these errors is that cuDNN lib is not linked or installed correctly.
Is the dockerfile in the latest_cuda_changes branch runnable on A100? It seems that the container built with the dockerfile in the main branch has some problems running your AE fig7 script 'python run_orion.py', reporting the error when I run it on my A100 machine: