Open crowbar27 opened 1 month ago
I was able to attach a remote debugger to the scheduler before it crashed and it seems that some obsolete code for FCM is running here:
[Obsolete("HAUtils is a utility class for failover cluster, do not use it anymore")]
[PermissionSet(SecurityAction.Demand, Name = "FullTrust")]
public class HAUtils
// ...
public unsafe static void SetGenericServiceRegistryCheckpoint(string serviceName, string RegPath)
{
//IL_0014: Expected I8, but got I
//IL_0026: Expected I8, but got I
//IL_0048: Expected I, but got I8
//IL_005c: Expected I, but got I8
//IL_00d5: Expected I, but got I8
//IL_0147: Expected I, but got I8
//IL_0164: Expected I, but got I8
//...
Is there any way to prevent this?
My understanding from the disassembly is that
public async Task InitKeyAndSalt()
{
if (useCache && encryptKey != null)
{
return;
}
CancellationToken token = HpcContext.Get().CancellationToken;
string text = await HpcContext.Get().Registry.GetValueAsync<string>("HKEY_LOCAL_MACHINE\\SYSTEM\\CurrentControlSet\\Services\\HpcScheduler\\Security", keyLocation, token);
if (string.IsNullOrEmpty(text))
{
using (CreateEncryptor())
{
}
await HpcContext.Get().Registry.SetValueAsync("HKEY_LOCAL_MACHINE\\SYSTEM\\CurrentControlSet\\Services\\HpcScheduler\\Security", keyLocation, Convert.ToBase64String(encryptKey), token);
await HpcContext.Get().Registry.SetValueAsync("HKEY_LOCAL_MACHINE\\SYSTEM\\CurrentControlSet\\Services\\HpcScheduler\\Security", initVectorLocation, Convert.ToBase64String(initVector), token);
}
else
{
encryptKey = Convert.FromBase64String(text);
initVector = Convert.FromBase64String(await HpcContext.Get().Registry.GetValueAsync<string>("HKEY_LOCAL_MACHINE\\SYSTEM\\CurrentControlSet\\Services\\HpcScheduler\\Security", initVectorLocation, token));
}
if (encryptKey == null)
{
throw new InvalidOperationException();
}
if (!HAUtils.IsHeadNodeHAClustered())
{
return;
}
try
{
HAUtils.SetGenericServiceRegistryCheckpoint("HpcScheduler", "HKEY_LOCAL_MACHINE\\SYSTEM\\CurrentControlSet\\Services\\HpcScheduler\\Security");
}
catch (ApplicationException ex)
{
if (((Win32Exception)ex.InnerException).NativeErrorCode != -2147024713)
{
throw;
}
}
}
prepares some cryptographic key that is stored in the registry and, provided the head node is in an FCM cluster, creates this snapshot to make sure that all clustered nodes have the same registry data. However, this fails as HPC Pack does not use FCM anymore.
Problem Description
I was upgrading on premise from HPC Pack 2016 to 2019, but after a successful installation, I cannot connect to the HPC Cluster Manager, because the HPC services are constantly crashing.
Steps to Reproduce
Expected Results
The connection succeeds.
Actual Results
The connection times out, because the HPC services are constantly restarting. Most importantly, it seems that the scheduler cannot start which causes all other services to fail, too. In the event log, I find entries like
Note that the diagnostics service is just an example, there are similar entries for other services including the scheduler.
Immediately before that, it logs event 1000 with details:
and before that I get event 1026 from the .NET Runtime:
Something seems to work, though, because I can see changes being made to the database, most notably for the built-in HA install, stuff was written to HPCHAWitness.
Additional Logs
Using hpctrace, I found that the scheduler is in a loop of:
Additonal Comments
As the call stack contains some crypto stuff, I was first suspecting an issue with the certificate, but it does not work with one from our AD-integrated CA nor with one created using the script provided with the installer. Furthermore,
GetKeyAndSalt
more suggests an issue with a symmetric encryption algorithm, but I don't know of anything I can influence in this direction.