Open 0x6b756d6172 opened 4 years ago
Can you please post your pool configuration file?
Apologies for the delay. This was due to an out of date driver. A fix will be applied in the next release.
Hi @alfpark, I have a customer in production having this issue right now! They are a public health lab and they use the results for organ transplants, so truly a matter of life/death. Is there a workaround while the fix is put in place?
You can always override the default GPU driver via pool configuration options. As a workaround, you can temporarily modify your pool.yaml
file to contain the following:
gpu:
nvidia_driver:
source: "https://us.download.nvidia.com/tesla/460.73.01/NVIDIA-Linux-x86_64-460.73.01.run"
Problem Description
Batch Shipyard fails to allocate the pool due to a failure of Nvidia installation. Issue is reproducible with provided Pytorch-GPU recipe.
Batch Shipyard Version
Docker, v3.9.1
Redacted Configuration
https://github.com/Azure/batch-shipyard/tree/master/recipes/PyTorch-GPU/config
Expected Results
Nvidia installation completes correctly and pool is created
Actual Results
Nvidia installation crashes and pool is not created correctly
Steps to Reproduce
shipyard stdout
startup/stderr.txt
Additonal Comments
Similar issue which appears to be marked closed: https://github.com/Azure/batch-shipyard/issues/348 Note, no modifications to the provided sample recipe was made beyond connection details