eth-easl / orion

An interference-aware scheduler for fine-grained GPU sharing
MIT License
79 stars 12 forks source link

Error in higher cuda version (11.8) #32

Open ZSL98 opened 4 months ago

ZSL98 commented 4 months ago

I face the error:

OSError: /root/orion/src/scheduler/scheduler_eval.so: undefined symbol: cudnnSetStream

when I am using cuda11.8. How to deal with that?

ZSL98 commented 4 months ago

Ah.. It seems that at the compile stage, the '-lcudnn' is required. However, there are still other bugs.

When I run: LD_PRELOAD="/root/orion/src/cuda_capture/libinttemp.so" python3.10 benchmarking/launch_jobs.py --algo orion --config_file /root/orion/artifact_evaluation/example/config.json

The error: python3.10: intercept_cudnn.cpp:177: cudnnStatus_t cudnnBatchNormalizationForwardInference(cudnnHandle_t, cudnnBatchNormMode_t, const void*, const void*, cudnnTensorDescriptor_t, const void*, cudnnTensorDescriptor_t, void*, cudnnTensorDescriptor_t, const void*, const void*, const void*, const void*, double): Assertion 'cudnn_bnorm_infer_func != NULL' failed. Aborted (core dumped)

ZSL98 commented 4 months ago

It seems that the API capture code only supports cuda-10.2. Could you please share more on how to apply API capturing on newer cuda? Or maybe there are other reasons?

fotstrt commented 4 months ago

Hi, yes the current open-source version supports only CUDA-10.2. Our next version will enable more up-to-date CUDA libraries. See also my comment in #31.

jelite commented 3 months ago

Can you notify me when this issue is resolved?