Open matt-chan opened 2 years ago
I've seen this happening sometimes. what is the username value ? make sure there is not already a home directory with that name. try to sudo to this user from the ondemand machine.
Hi Xavier,
One of the username values is matttest
. The other has a hyphen in it, but both are broken.
Thanks for the pointers with debugging. I think I found the source of the errors. I upgraded the AD from windows server 2016 to 2019 during the last upgrade. I found this message in /var/log/messages on ood: ondemand sssd[ldap_child[18463]]: Failed to initialize credentials using keytab [MEMORY:/etc/krb5.keytab]: Client 'ONDEMAND$@HPC.AZURE' not found in Kerberos database. Unable to create GSSAPI-encrypted LDAP connection.
Do you know how to reestablish the connection with AD after an upgrade? I even tried the "rejoin domain" task that was commented out in the domain_join playbook.
$ echo "<domain password>" | realm join -v -U hpcadmin hpc.azure
* Resolving: _ldap._tcp.hpc.azure
* Performing LDAP DSE lookup on: 10.2.3.4
* Successfully discovered: hpc.azure
Password for hpcadmin:
* Required files: /usr/sbin/oddjobd, /usr/libexec/oddjob/mkhomedir, /usr/sbin/sssd, /usr/bin/net
* LANG=C LOGNAME=root /usr/bin/net -s /var/cache/realmd/realmd-smb-conf.YENRU1 -U hpcadmin ads join hpc.azure
Enter hpcadmin's password:DNS update failed: NT_STATUS_UNSUCCESSFUL
Using short domain name -- HPC
Joined 'ONDEMAND' to dns domain 'hpc.azure'
DNS Update for ondemand.<url>.jx.internal.cloudapp.net failed: ERROR_DNS_UPDATE_FAILED
* LANG=C LOGNAME=root /usr/bin/net -s /var/cache/realmd/realmd-smb-conf.YENRU1 -U hpcadmin ads keytab create
Enter hpcadmin's password:
* /usr/bin/systemctl enable sssd.service
Created symlink from /etc/systemd/system/multi-user.target.wants/sssd.service to /usr/lib/systemd/system/sssd.service.
* /usr/bin/systemctl restart sssd.service
* /usr/bin/sh -c /usr/sbin/authconfig --update --enablesssd --enablesssdauth --enablemkhomedir --nostart && /usr/bin/systemctl enable oddjobd.service && /usr/bin/systemctl start oddjobd.service
* Successfully enrolled machine in realm
SSSD still reports an error during joining though: tkey query failed: GSSAPI error: Major = Unspecified GSS failure. Minor code may provide more information, Minor = Server not found in Kerberos database.
I wonder if the key was rotated on AD during the upgrade and now all the clients are left behind?
Thanks! Matt
ok, that explains why. Yeah, rejoining a domain may work, but probably need manual tasks. The best would be to rebuild all domain joined VM from scratch after such upgrade.
Hi @xpillons , would you be able to point us to the manual tasks we need to follow to properly rejoin the domain?
You have to leave the domain, and then rejoin it. Leaving the domain is documented in the domain_join role and commented. You may run these commands manually on the VMs ondemand, jumbox, scheduler and grafana.
Hi Xavier,
I got it fixed. Thanks for the hint. Basically the issue was that when you rejoin the domain, it resets sssd.conf. So the real fix is to modify the playbook and re-run the whole thing, not simply copying that one step.
Would it make sense to make 'rejoin domain' the default step in the domain_join playbook, or would that stress the AD server too much? Admittedly this is a rare case. We only ran into this after upgrading the AD server...
Thanks! Matt
Ah also, tkey query failed: GSSAPI error: Major = Unspecified GSS failure. Minor code may provide more information, Minor = Server not found in Kerberos database.
is a red herring. This error exists even on a machine with a good setup.
@matt-chan adding rejoin may be a good thing, it won't stress AD as this is only few machines. will have to test all the various cases to make sure the playbook is successful.
In what area(s)?
Expected Behavior
When adding a new user, it should proceed without error.
Actual Behavior
In the step "Init User context for", ansible fails with:
AD shows the user has been added. /anfhome does not show the user's home directory.
Steps to Reproduce the Problem
To reproduce this, add a new user to an existing cluster. Currently running bleeding edge from main branch: 1b4ddf6. Cluster was upgraded several times.