ansible-pull-script fails on gpu nodes

fgci-org / fgci-ansible

:microscope: Collection of the Finnish Grid and Cloud Infrastructure Ansible playbooks

MIT License

54 stars 18 forks source link

ansible-pull-script fails on gpu nodes #160

Closed jabl closed 8 years ago

jabl commented 8 years ago

ansible-pull-script.sh tries to update all the packages to the latest versions, which fails because the task

TASK [ansible-role-yum : update all the things] ****

fails, because the nvidia cuda repo gpg key has changed. So the playbook never gets as far as running the cuda role which would install the new key.

One can work around it by first running something like

ansible-playbook compute.yml --limit=production,gpu1,gpu2,gpuN --tags=cuda

but it would be nicer if it would work automagically somehow.

martbhell commented 8 years ago

Yeah, the CUDA 8 packages are signed with a different key.

Should be fixed in devel with https://github.com/CSC-IT-Center-for-Science/fgci-ansible/commit/d38050abbed067d329efc6d173a62063645f5b58

Master update on Monday includes a fix to NHC that should drain and reboot the gpu nodes too properly on a library update. Keeping this issue open, please test and let us know how if it solves the issue.

jabl commented 8 years ago

No, https://github.com/CSC-IT-Center-for-Science/fgci-ansible/commit/d38050abbed067d329efc6d173a62063645f5b58 doesn't fix it. In fact, when I opened this issue we were already using https://github.com/CSC-IT-Center-for-Science/fgci-ansible/commit/a78f8a8f9f95d34dfee79d4771cf474507e5479a, a couple of commits ahead of https://github.com/CSC-IT-Center-for-Science/fgci-ansible/commit/d38050abbed067d329efc6d173a62063645f5b58.

I wonder, do we need the "ansible-role-yum : update all the things" task, can't we just rely on yum-cron to keep everything updated?

martbhell commented 8 years ago

Yes I think we can disable that task by default. We use yum-cron on all the servers in all the playbooks.

martbhell commented 8 years ago

Updated with new version of yum role - please let me know if it works better now.

jabl commented 8 years ago

ansible-pull-script.sh went through this time. I guess yum-cron will later take care of updating to cuda 8 and the latest nvidia.ko kernel module.

martbhell commented 8 years ago

great. That's my hope. It should install CUDA8 locally (it seems to not remove any cuda7.5 packages, so reinstalled nodes will only have cuda8 locally. There's also a cuda in the modules ( /cvmfs/fgi.csc.fi/devel/cuda/cudatoolkit-7.5.18/bin )

<5 minutes later NHC will run and should hopefully notice that nvidia-smi gives a drivers/module mismatch and mark the node to "reboot" and draining in slurm.

jabl commented 8 years ago

Ugh, it doesn't work entirely, actually. Our (the default?) yum-cron config has for yum-cron-hourly:

update_cmd = security
apply_updates = True

and for yum-cron.conf:

update_cmd = default
apply_updates = False

so as cuda 8 is not marked as a security update of cuda 7.5, it never gets automatically installed.

Edit: in group_vars, we should probably set daily_apply_updates to True.

martbhell commented 8 years ago

Yeah, I guess the update would be fetched during the daily run tonight, maybe you get an e-mail about it and then have the possibility to install the update or not.

FGCI recommendation has been to run with auto-update. One way to accomplish that for hourly updates is to set update_cmd to default on yum-cron-hourly like we have in the examples 1.

hourly_update_level: "default"

The settings are however somewhat confusing. I would lean towards having autoupdate for default on daily, but only for security on hourly. One reason why the settings are the way they are is because of many reports of e-mail spam from the sites when yum-cron only fetched updates and did not apply them automatically.

In general there are opinions for both having auto-update on and having it disabled.