Azure / batch-shipyard

Simplify HPC and Batch workloads on Azure
MIT License
277 stars 121 forks source link

Pool Allocation Fails Due to Nvidia Installation #357

Open 0x6b756d6172 opened 4 years ago

0x6b756d6172 commented 4 years ago

Problem Description

Batch Shipyard fails to allocate the pool due to a failure of Nvidia installation. Issue is reproducible with provided Pytorch-GPU recipe.

Batch Shipyard Version

Docker, v3.9.1

Redacted Configuration

https://github.com/Azure/batch-shipyard/tree/master/recipes/PyTorch-GPU/config

Expected Results

Nvidia installation completes correctly and pool is created

Actual Results

Nvidia installation crashes and pool is not created correctly

Steps to Reproduce

# clone
git clone https://github.com/Azure/batch-shipyard.git

# map pytorch folder to container and run into bash
sudo docker run -it -u 1000 --entrypoint /bin/bash -v /home/user/Workspace/batch-shipyard/recipes/PyTorch-GPU/config:/srv mcr.microsoft.com/azure-batch/shipyard:latest-cli

# attempt to create pool
cd /srv
/opt/batch-shipyard/shipyard.py pool add

shipyard stdout

2020-08-14 00:41:37.777 INFO - Attempting to create pool: pytorch-gpu
2020-08-14 00:41:38.254 INFO - Created pool: pytorch-gpu
2020-08-14 00:41:38.255 DEBUG - waiting for all nodes in pool pytorch-gpu to reach one of: frozenset({<ComputeNodeState.unusable: 'unusable'>, <ComputeNodeState.idle: 'idle'>, <ComputeNodeState.start_task_failed: 'starttaskfailed'>, <ComputeNodeState.preempted: 'preempted'>})
2020-08-14 00:41:59.795 DEBUG - waiting for 1 dedicated nodes and 0 low priority nodes of size standard_nc6 to reach desired state in pool pytorch-gpu [resize_timeout=0:15:00 allocation_state=resizing allocation_state_transition_time=2020-08-14 00:41:38.198513+00:00]
2020-08-14 00:42:21.262 DEBUG - waiting for 1 dedicated nodes and 0 low priority nodes of size standard_nc6 to reach desired state in pool pytorch-gpu [resize_timeout=0:15:00 allocation_state=resizing allocation_state_transition_time=2020-08-14 00:41:38.198513+00:00]
2020-08-14 00:42:42.683 DEBUG - waiting for 1 dedicated nodes and 0 low priority nodes of size standard_nc6 to reach desired state in pool pytorch-gpu [resize_timeout=0:15:00 allocation_state=resizing allocation_state_transition_time=2020-08-14 00:41:38.198513+00:00]
2020-08-14 00:43:04.160 DEBUG - waiting for 1 dedicated nodes and 0 low priority nodes of size standard_nc6 to reach desired state in pool pytorch-gpu [resize_timeout=0:15:00 allocation_state=steady allocation_state_transition_time=2020-08-14 00:42:45.162551+00:00]
2020-08-14 00:43:04.161 DEBUG - tvmps_ae2ee2771a9722b56f8e3ba7be50d22c40a16f3026653a69d0e40a514fb1a1aa_d: starting
2020-08-14 00:43:25.670 DEBUG - waiting for 1 dedicated nodes and 0 low priority nodes of size standard_nc6 to reach desired state in pool pytorch-gpu [resize_timeout=0:15:00 allocation_state=steady allocation_state_transition_time=2020-08-14 00:42:45.162551+00:00]
2020-08-14 00:43:25.670 DEBUG - tvmps_ae2ee2771a9722b56f8e3ba7be50d22c40a16f3026653a69d0e40a514fb1a1aa_d: waitingforstarttask
2020-08-14 00:43:47.182 DEBUG - waiting for 1 dedicated nodes and 0 low priority nodes of size standard_nc6 to reach desired state in pool pytorch-gpu [resize_timeout=0:15:00 allocation_state=steady allocation_state_transition_time=2020-08-14 00:42:45.162551+00:00]
2020-08-14 00:43:47.183 DEBUG - tvmps_ae2ee2771a9722b56f8e3ba7be50d22c40a16f3026653a69d0e40a514fb1a1aa_d: waitingforstarttask
2020-08-14 00:44:08.719 DEBUG - waiting for 1 dedicated nodes and 0 low priority nodes of size standard_nc6 to reach desired state in pool pytorch-gpu [resize_timeout=0:15:00 allocation_state=steady allocation_state_transition_time=2020-08-14 00:42:45.162551+00:00]
2020-08-14 00:44:08.720 DEBUG - tvmps_ae2ee2771a9722b56f8e3ba7be50d22c40a16f3026653a69d0e40a514fb1a1aa_d: waitingforstarttask
2020-08-14 00:44:24.104 DEBUG - listing nodes in start task failed state
2020-08-14 00:44:24.105 INFO - compute nodes for pool pytorch-gpu (filters: start_task_failed=True unusable=False)
* node id: tvmps_ae2ee2771a9722b56f8e3ba7be50d22c40a16f3026653a69d0e40a514fb1a1aa_d
  * state: starttaskfailed @ 2020-08-14 00:44:21.111448+00:00
  * allocation time: 2020-08-14 00:42:44.926558+00:00
  * last boot time: 2020-08-14 00:43:04.203110+00:00
  * scheduling state: enabled
  * agent:
    * version: 1.8.3
    * last update time: 2020-08-14 00:43:04.203110+00:00
  * no errors
  * start task:
    * failure info: usererror
      * FailureExitCode: The task exited with an exit code representing a failure
        * Message: The task process exited with an unexpected exit code
        * AdditionalErrorCode: FailureExitCode
  * vm size: standard_nc6
  * dedicated: True
  * ip address: 10.0.0.4
  * running tasks: 0
  * total tasks run: 0
  * total tasks succeeded: 0
2020-08-14 00:44:24.105 ERROR - Detected start task failure, attempting to retrieve files for error diagnosis from nodes
2020-08-14 00:44:24.146 DEBUG - downloading files to pytorch-gpu/tvmps_ae2ee2771a9722b56f8e3ba7be50d22c40a16f3026653a69d0e40a514fb1a1aa_d
2020-08-14 00:44:25.077 INFO - all files retrieved from pool=pytorch-gpu node=tvmps_ae2ee2771a9722b56f8e3ba7be50d22c40a16f3026653a69d0e40a514fb1a1aa_d include=startup/std*.txt
2020-08-14 00:44:25.077 DEBUG - downloading files to pytorch-gpu/tvmps_ae2ee2771a9722b56f8e3ba7be50d22c40a16f3026653a69d0e40a514fb1a1aa_d
2020-08-14 00:44:25.127 ERROR - no files found for pool pytorch-gpu node tvmps_ae2ee2771a9722b56f8e3ba7be50d22c40a16f3026653a69d0e40a514fb1a1aa_d include=startup/wd/cascade*.log
Traceback (most recent call last):
  File "/opt/batch-shipyard/shipyard.py", line 3136, in <module>
    cli()
  File "/usr/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.7/site-packages/click/decorators.py", line 64, in new_func
    return ctx.invoke(f, obj, *args, **kwargs)
  File "/usr/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/opt/batch-shipyard/shipyard.py", line 1546, in pool_add
    ctx.table_client, ctx.keyvault_client, ctx.config, recreate, no_wait)
  File "/opt/batch-shipyard/convoy/fleet.py", line 3451, in action_pool_add
    batch_client, blob_client, keyvault_client, config, no_wait
  File "/opt/batch-shipyard/convoy/fleet.py", line 1876, in _add_pool
    nodes = batch.create_pool(batch_client, blob_client, config, pool, no_wait)
  File "/opt/batch-shipyard/convoy/batch.py", line 959, in create_pool
    return wait_for_pool_ready(batch_client, blob_client, config, pool.id)
  File "/opt/batch-shipyard/convoy/batch.py", line 893, in wait_for_pool_ready
    pool_id)
  File "/opt/batch-shipyard/convoy/batch.py", line 748, in _block_for_nodes_ready
    'prior to the resize operation.').format(pool.id))
RuntimeError: Please inspect both the node status above and files found within the pytorch-gpu/<nodes>/startup directory (in the current working directory) if available. If this error appears non-transient, please submit an issue on GitHub, if not you can delete these nodes with "pool nodes del --all-start-task-failed" first prior to the resize operation.

startup/stderr.txt

Warning: apt-key output should not be parsed (stdout is not a terminal)
Synchronizing state of docker.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install disable docker
WARNING: API is accessible on http://127.0.0.1:2375 without encryption.
         Access to the remote API is equivalent to root access on the host. Refer
         to the 'Docker daemon attack surface' section in the documentation for
         more information: https://docs.docker.com/engine/security/security/#docker-daemon-attack-surface
WARNING: No swap limit support
rmmod: ERROR: Module nouveau is not currently loaded

WARNING: nvidia-installer was forced to guess the X library path '/usr/lib' and X module path '/usr/lib/xorg/modules'; these paths were not queryable from the system.  If X fails to find the NVIDIA X driver module, please install the `pkg-config` utility and the X.Org SDK/development package for your distribution and reinstall the driver.

WARNING: Unable to find a suitable destination to install 32-bit compatibility libraries. Your system may not be set up for 32-bit compatibility. 32-bit compatibility files will not be installed; if you wish to install them, re-run the installation and set a valid directory with the --compat32-libdir option.

ERROR: Failed to run `/usr/sbin/dkms build -m nvidia -v 418.87.01 -k 5.3.0-1034-azure`: 
Kernel preparation unnecessary for this kernel.  Skipping...

Building module:
cleaning build area...
'make' -j6 NV_EXCLUDE_BUILD_MODULES='' KERNEL_UNAME=5.3.0-1034-azure IGNORE_CC_MISMATCH='' modules.....(bad exit status: 2)
ERROR (dkms apport): binary package for nvidia: 418.87.01 not found
Error! Bad return status for module build on kernel: 5.3.0-1034-azure (x86_64)
Consult /var/lib/dkms/nvidia/418.87.01/build/make.log for more information.

ERROR: Failed to install the kernel module through DKMS. No kernel module was installed; please try installing again without DKMS, or check the DKMS logs for more information.

ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

Additonal Comments

Similar issue which appears to be marked closed: https://github.com/Azure/batch-shipyard/issues/348 Note, no modifications to the provided sample recipe was made beyond connection details

alfpark commented 4 years ago

Can you please post your pool configuration file?

alfpark commented 3 years ago

Apologies for the delay. This was due to an out of date driver. A fix will be applied in the next release.

fayora commented 3 years ago

Hi @alfpark, I have a customer in production having this issue right now! They are a public health lab and they use the results for organ transplants, so truly a matter of life/death. Is there a workaround while the fix is put in place?

alfpark commented 3 years ago

You can always override the default GPU driver via pool configuration options. As a workaround, you can temporarily modify your pool.yaml file to contain the following:

  gpu:
    nvidia_driver:
      source: "https://us.download.nvidia.com/tesla/460.73.01/NVIDIA-Linux-x86_64-460.73.01.run"