Azure / az-hop

The Azure HPC On-Demand Platform provides an HPC Cluster Ready solution
https://azure.github.io/az-hop/
MIT License
65 stars 53 forks source link

Adding new user fails with 'chmod: invalid mode' #1152

Open matt-chan opened 2 years ago

matt-chan commented 2 years ago

In what area(s)?

/area administration /area ansible /area autoscaling /area configuration /area cyclecloud /area documentation /area image /area job-scheduling /area monitoring /area ood /area remote-visualization /area user-management

Expected Behavior

When adding a new user, it should proceed without error.

Actual Behavior

In the step "Init User context for ", ansible fails with:

fatal: [ondemand]: FAILED! => {"msg": "Failed to set permissions on the temporary files Ansible needs to create when becoming an unprivileged user (rc: 1, err: chmod: invalid mode: 'A+user:<username>:rx:allow'\nTry 'chmod --help' for more information.\n}). For information on working around this, see https://docs.ansible.com/ansible-core/2.13/user_guide/become.html#risks-of-becoming-an-unprivileged-user"}

AD shows the user has been added. /anfhome does not show the user's home directory.

Steps to Reproduce the Problem

To reproduce this, add a new user to an existing cluster. Currently running bleeding edge from main branch: 1b4ddf6. Cluster was upgraded several times.

xpillons commented 2 years ago

I've seen this happening sometimes. what is the username value ? make sure there is not already a home directory with that name. try to sudo to this user from the ondemand machine.

matt-chan commented 2 years ago

Hi Xavier,

One of the username values is matttest. The other has a hyphen in it, but both are broken.

Thanks for the pointers with debugging. I think I found the source of the errors. I upgraded the AD from windows server 2016 to 2019 during the last upgrade. I found this message in /var/log/messages on ood: ondemand sssd[ldap_child[18463]]: Failed to initialize credentials using keytab [MEMORY:/etc/krb5.keytab]: Client 'ONDEMAND$@HPC.AZURE' not found in Kerberos database. Unable to create GSSAPI-encrypted LDAP connection.

Do you know how to reestablish the connection with AD after an upgrade? I even tried the "rejoin domain" task that was commented out in the domain_join playbook.

$ echo "<domain password>" | realm join -v -U hpcadmin hpc.azure
 * Resolving: _ldap._tcp.hpc.azure
 * Performing LDAP DSE lookup on: 10.2.3.4
 * Successfully discovered: hpc.azure
Password for hpcadmin:
 * Required files: /usr/sbin/oddjobd, /usr/libexec/oddjob/mkhomedir, /usr/sbin/sssd, /usr/bin/net
 * LANG=C LOGNAME=root /usr/bin/net -s /var/cache/realmd/realmd-smb-conf.YENRU1 -U hpcadmin ads join hpc.azure
Enter hpcadmin's password:DNS update failed: NT_STATUS_UNSUCCESSFUL

Using short domain name -- HPC
Joined 'ONDEMAND' to dns domain 'hpc.azure'
DNS Update for ondemand.<url>.jx.internal.cloudapp.net failed: ERROR_DNS_UPDATE_FAILED
 * LANG=C LOGNAME=root /usr/bin/net -s /var/cache/realmd/realmd-smb-conf.YENRU1 -U hpcadmin ads keytab create
Enter hpcadmin's password:
 * /usr/bin/systemctl enable sssd.service
Created symlink from /etc/systemd/system/multi-user.target.wants/sssd.service to /usr/lib/systemd/system/sssd.service.
 * /usr/bin/systemctl restart sssd.service
 * /usr/bin/sh -c /usr/sbin/authconfig --update --enablesssd --enablesssdauth --enablemkhomedir --nostart && /usr/bin/systemctl enable oddjobd.service && /usr/bin/systemctl start oddjobd.service
 * Successfully enrolled machine in realm

SSSD still reports an error during joining though: tkey query failed: GSSAPI error: Major = Unspecified GSS failure. Minor code may provide more information, Minor = Server not found in Kerberos database.

I wonder if the key was rotated on AD during the upgrade and now all the clients are left behind?

Thanks! Matt

xpillons commented 2 years ago

ok, that explains why. Yeah, rejoining a domain may work, but probably need manual tasks. The best would be to rebuild all domain joined VM from scratch after such upgrade.

ltalirz commented 2 years ago

Hi @xpillons , would you be able to point us to the manual tasks we need to follow to properly rejoin the domain?

xpillons commented 2 years ago

You have to leave the domain, and then rejoin it. Leaving the domain is documented in the domain_join role and commented. You may run these commands manually on the VMs ondemand, jumbox, scheduler and grafana.

matt-chan commented 2 years ago

Hi Xavier,

I got it fixed. Thanks for the hint. Basically the issue was that when you rejoin the domain, it resets sssd.conf. So the real fix is to modify the playbook and re-run the whole thing, not simply copying that one step.

Would it make sense to make 'rejoin domain' the default step in the domain_join playbook, or would that stress the AD server too much? Admittedly this is a rare case. We only ran into this after upgrading the AD server...

Thanks! Matt

matt-chan commented 2 years ago

Ah also, tkey query failed: GSSAPI error: Major = Unspecified GSS failure. Minor code may provide more information, Minor = Server not found in Kerberos database. is a red herring. This error exists even on a machine with a good setup.

xpillons commented 2 years ago

@matt-chan adding rejoin may be a good thing, it won't stress AD as this is only few machines. will have to test all the various cases to make sure the playbook is successful.