Open haitwang-cloud opened 3 weeks ago
@elezar Could u PTAL?
I already try the echo get_default_active_thread_percentage | nvidia-cuda-mps-control
from https://github.com/NVIDIA/k8s-device-plugin/issues/647, everything looks fine and the hang process issue still happens.
root@gpu-pod:/# echo get_default_active_thread_percentage | nvidia-cuda-mps-control
25.0
1. Quick Debug Information
2. Issue or feature description
To confirm that the MPS feature is capable of catering to our needs, I'm scheduling a comprehensive test suite for further analysis.
I've been testing the MPS feature offered by the recently updated device-plugin V15.0 on our K8S BM node equipped with 3 V100s. However, it only seems to execute the standard test
cuda-sample:vectoradd
successfully, while the rest of the test cases continually remain in a hang state, thus not progressing as expected.✅Passed case
❌Failed case
3. Information to attach (optional if deemed irrelevant)
I am using the following configmap to enable the
MPS
under our GPU BMAnd use the following commands to install the latest device-plugin
And the related logs can be found in
Logs of the MPS control daemon and device-plugin
MPS control daemon logs
Nvidia-device-plugin logs
✅Passed case
cuda-samples and check the nvidia-smi log
❌ Failed case
I am using this classification.ipynb to have a E2E verify about the MPS use in tensorflow, while it turns out that the pod keeps hang 60 mins and don't get any response, and the log can not get more details.
classification.ipynb logs
And here is the current MPS daemon log