gazebo-tooling / release-tools

8 stars 9 forks source link

`nvml error: driver/library version mismatch` in Linux GPU agents #912

Closed Crola1702 closed 1 year ago

Crola1702 commented 1 year ago

Sometimes, Linux GPU Agents make build fail early because of unattended-upgrades.

Reference build: https://build.osrfoundation.org/job/ignition_gazebo-ci-gz-sim7-focal-amd64/91/console

Log output:

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #1: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown.
Build step 'Execute shell' marked build as failure

The solution for now is rebooting the agent each time it happens (check this StackOverflow answer), but we should automate the response to this problem via:

  1. catch this error in script and send a sigil to the Jenkins post-build runner which
  2. marks the Jenkins host offline
  3. re-queues the job
  4. forces a reboot of the Jenkins agent
j-rivero commented 1 year ago

Linux GPU Agents make build fail early because of unattended-upgrades.

I'm assuming that we prefer to have unattended-upgrades running on all the packages that trying to pin/freeze the affected packages so we don't have the problem until we decide to update the image for a good reason.

  1. catch this error in script and send a sigil to the Jenkins post-build runner which

I've being using the "Failure Cause Management" plugin (configuration is stored in chef) for a while to notify a message in the main build page when a regexp is being found but a part of this I don't think that we can do much more with that plugin.

  1. marks the Jenkins host offline
  2. re-queues the job

The ros_buidlfarm repo hosts groovy scripts that can serve you to create a custom solution for this. For example this one checks for a regexp in the log or this one disables a node and log the problem.

claraberendsen commented 1 year ago

@j-rivero I'm blocked with this task. Here has been my approaches so far:

Would appreciate any insight you might have on how to do :

j-rivero commented 1 year ago

Fixed by #915