NVIDIA / deepops

Tools for building GPU clusters
BSD 3-Clause "New" or "Revised" License
1.25k stars 326 forks source link

GPU is disassociating after running a playbook #1246

Closed georgettica closed 1 year ago

georgettica commented 1 year ago

I ran a playbook with

  ---
  - hosts: all
    gather_facts: False
    tasks:
    - name: Run Role
      ansible.builtin.include_role: ...

some of the containers on a node (running via slurm) are getting an NVML Error.

I thought it might be because of the /etc/ansible/facts.d/*, but it didn't happen.

the Role I am running is copying files to the host, and running systemctl restart <SVC> && systemctl enable <SVC> using a playbook.

If you would like additional logs I would gladly sanitize and send to you

ajdecon commented 1 year ago

Yeah, it's pretty tricky to understand what's going on here from just what you've included. It would help to know at least what service(s) you are restarting, and how you are running your containers (e.g., Enroot? Singularity? Docker?)

georgettica commented 1 year ago

yeah so to answer your questions:

The service is ufw and a custom one

I am running deepops using the slurm option and in that I am spinning up a docker

The service itself:

[Unit]
Description=XXXX
Wants=network-online.target
After=netfilter-persistent.service

[Service]

ExecStart=sudo XXXX
ExecStop=pkill -9 XXX

[Install]
WantedBy=multi-user.target
github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 60 days with no activity. Please update the issue or it will be closed in 7 days.