Azure / hpcpack

The repo to track public issues for Microsoft HPC Pack product.
MIT License
30 stars 12 forks source link

New install fails to start services due to Win32Exception in Microsoft.ComputeCluster.Management.Win32Helpers.HAUtils.SetGenericServiceRegistryCheckpoint #52

Open crowbar27 opened 1 month ago

crowbar27 commented 1 month ago

Problem Description

I was upgrading on premise from HPC Pack 2016 to 2019, but after a successful installation, I cannot connect to the HPC Cluster Manager, because the HPC services are constantly crashing.

Steps to Reproduce

Expected Results

The connection succeeds.

Actual Results

The connection times out, because the HPC services are constantly restarting. Most importantly, it seems that the scheduler cannot start which causes all other services to fail, too. In the event log, I find entries like

The HPC Diagnostics Service service terminated unexpectedly. It has done this 118 time(s). The following corrective action will be taken in 30000 milliseconds: Restart the service.

Note that the diagnostics service is just an example, there are similar entries for other services including the scheduler.

Immediately before that, it logs event 1000 with details:

Faulting application name: HpcDiagnostics.exe, version: 6.2.7756.0, time stamp: 0x65133346 Faulting module name: KERNELBASE.dll, version: 10.0.14393.7426, time stamp: 0x66f60177 Exception code: 0xe0434352 Fault offset: 0x0000000000026ea8 Faulting process id: 0x2354 Faulting application start time: 0x01db2181b5a6aaae Faulting application path: C:\Program Files\Microsoft HPC Pack 2019\Bin\HpcDiagnostics.exe Faulting module path: C:\windows\System32\KERNELBASE.dll Report Id: cc5f5cc1-938b-4b0c-949a-e806b4a8cc6f Faulting package full name: Faulting package-relative application ID:

and before that I get event 1026 from the .NET Runtime:

Application: HpcDiagnostics.exe Framework Version: v4.0.30319 Description: The process was terminated due to an unhandled exception. Exception Info: System.ComponentModel.Win32Exception at Microsoft.ComputeCluster.Management.Win32Helpers.HAUtils.SetGenericServiceRegistryCheckpoint(System.String, System.String) at Microsoft.Hpc.Diagnostics.Store.DiagnosticCrypto+d23.MoveNext()> at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task) at Microsoft.Hpc.Diagnostics.Store.DiagnosticCrypto+d18.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task) at Microsoft.Hpc.Diagnostics.Store.DiagnosticsStore+d27.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task) at Microsoft.Hpc.Diagnostics.DiagnosticsSvc+d8.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task) at Microsoft.Hpc.DiagnosticsWinService.DiagnosticsWinService+<b2_1>d.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task) at Microsoft.Hpc.MembershipDisabled+d0.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task) at Microsoft.Hpc.DiagnosticsWinService.DiagnosticsWinService+<>cDisplayClass2_0+<b0>d.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean) at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean) at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem() at System.Threading.ThreadPoolWorkQueue.Dispatch()

Something seems to work, though, because I can see changes being made to the database, most notably for the built-in HA install, stuff was written to HPCHAWitness.

Additional Logs

Using hpctrace, I found that the scheduler is in a loop of:

17:45:04.933 i HpcScheduler 8428 5548 The HPC job scheduler started.
10/15/2024 17:45:05.120 e HpcScheduler 8428 5548 [[ServiceCore].StartSvc] .Exception detail: System.ComponentModel.Win32Exception (0x80004005): Failed to set registry checkpoint on service (error: 0).. at Microsoft.ComputeCluster.Management.Win32Helpers.HAUtils.SetGenericServiceRegistryCheckpoint(String serviceName, String RegPath).. at Microsoft.Hpc.Scheduler.SchedulerCrypto.d29.MoveNext()..--- End of stack trace from previous location where exception was thrown ---.. at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw().. at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task).. at Microsoft.Hpc.Scheduler.SchedulerCrypto.d24.MoveNext()..--- End of stack trace from previous location where exception was thrown ---.. at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw().. at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task).. at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal..ctor(Boolean restoreMode, Boolean schedulerOnAzure, String clusterName, String sqlString, String builtInAdmin, String builtInAdminPass, Func2 azureUserPasswordDecryptor, Func2 azureUserPasswordEncryptor).. at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal..ctor(Boolean restoreMode).. at Microsoft.Hpc.Scheduler.SchedulerSvc.d19.MoveNext().Current stack: at Microsoft.Hpc.Scheduler.SchedulerTracingUtil.GenMessageFormat(String message, Object[] args, String e, String& newMessage, Object[]& newArgs).. at Microsoft.Hpc.Scheduler.SchedulerTracing.TraceException(String facility, Int32 jobId, Int32 taskId, Int32[] resourceId, String nodeName, Exception e, TraceEventType level, String message, Object[] args).. at Microsoft.Hpc.Scheduler.SchedulerTracing.TraceException(String facility, Exception e, String message, Object[] args).. at Microsoft.Hpc.Scheduler.SchedulerSvc.d19.MoveNext().. at System.Runtime.CompilerServices.AsyncTaskMethodBuilder.Start[TStateMachine](TStateMachine& stateMachine).. at Microsoft.Hpc.Scheduler.SchedulerSvc.StartSvc(IHpcContext context).. at Microsoft.Hpc.Scheduler.SchedulerService.<b5_1>d.MoveNext().. at System.Runtime.CompilerServices.AsyncTaskMethodBuilder.Start[TStateMachine](TStateMachine& stateMachine).. at Microsoft.Hpc.Scheduler.SchedulerService.b__5_1().. at Microsoft.Hpc.HighAvailabilityModule.Algorithm.MembershipWithWitness.<>cDisplayClass45_0.b0(Object _).. at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx).. at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx).. at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem().. at System.Threading.ThreadPoolWorkQueue.Dispatch()..
10/15/2024 17:45:05.136 e HpcTrace 8428 5548 Current Application Domain UnhandledException event invoked: System.ComponentModel.Win32Exception (0x80004005): Failed to set registry checkpoint on service (error: 0).. at Microsoft.ComputeCluster.Management.Win32Helpers.HAUtils.SetGenericServiceRegistryCheckpoint(String serviceName, String RegPath).. at Microsoft.Hpc.Scheduler.SchedulerCrypto.d29.MoveNext()..--- End of stack trace from previous location where exception was thrown ---.. at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw().. at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task).. at Microsoft.Hpc.Scheduler.SchedulerCrypto.d24.MoveNext()..--- End of stack trace from previous location where exception was thrown ---.. at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw().. at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task).. at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal..ctor(Boolean restoreMode, Boolean schedulerOnAzure, String clusterName, String sqlString, String builtInAdmin, String builtInAdminPass, Func2 azureUserPasswordDecryptor, Func2 azureUserPasswordEncryptor).. at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal..ctor(Boolean restoreMode).. at Microsoft.Hpc.Scheduler.SchedulerSvc.d19.MoveNext()..--- End of stack trace from previous location where exception was thrown ---.. at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw().. at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task).. at Microsoft.Hpc.Scheduler.SchedulerService.<b5_1>d.MoveNext()..--- End of stack trace from previous location where exception was thrown ---.. at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw().. at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task).. at Microsoft.Hpc.HighAvailabilityModule.Algorithm.MembershipWithWitness.<>c
DisplayClass45_0.b_0(Object ).. at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx).. at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx).. at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem().. at System.Threading.ThreadPoolWorkQueue.Dispatch() 10/15/2024 17:45:05.152 i HpcTrace 8428 5548 Cosmos Logger is being closed
10/15/2024 17:46:38.471 w Microsoft.Hpc.HighAvailablity.Algorithm 12556 15320 [2024-10-15T17:46:38.3615810Z][Protocol][06018788-c18f-4aeb-a708-7d8e85d51f2e] Primary down
10/15/2024 17:46:40.549 i HpcScheduler.exe 12556 10052 [GetCertificateValidationCallback] Bypass certificate CN validation.
10/15/2024 17:46:40.549 i HpcScheduler.exe 12556 10052 [GetCertificateValidationCallback] Bypass certificate CN validation.

Additonal Comments

As the call stack contains some crypto stuff, I was first suspecting an issue with the certificate, but it does not work with one from our AD-integrated CA nor with one created using the script provided with the installer. Furthermore, GetKeyAndSalt more suggests an issue with a symmetric encryption algorithm, but I don't know of anything I can influence in this direction.

crowbar27 commented 1 month ago

I was able to attach a remote debugger to the scheduler before it crashed and it seems that some obsolete code for FCM is running here:

[Obsolete("HAUtils is a utility class for failover cluster, do not use it anymore")]
[PermissionSet(SecurityAction.Demand, Name = "FullTrust")]
public class HAUtils

// ...

    public unsafe static void SetGenericServiceRegistryCheckpoint(string serviceName, string RegPath)
    {
        //IL_0014: Expected I8, but got I
        //IL_0026: Expected I8, but got I
        //IL_0048: Expected I, but got I8
        //IL_005c: Expected I, but got I8
        //IL_00d5: Expected I, but got I8
        //IL_0147: Expected I, but got I8
        //IL_0164: Expected I, but got I8
//...

Is there any way to prevent this?

crowbar27 commented 1 month ago

My understanding from the disassembly is that

    public async Task InitKeyAndSalt()
    {
        if (useCache && encryptKey != null)
        {
            return;
        }
        CancellationToken token = HpcContext.Get().CancellationToken;
        string text = await HpcContext.Get().Registry.GetValueAsync<string>("HKEY_LOCAL_MACHINE\\SYSTEM\\CurrentControlSet\\Services\\HpcScheduler\\Security", keyLocation, token);
        if (string.IsNullOrEmpty(text))
        {
            using (CreateEncryptor())
            {
            }
            await HpcContext.Get().Registry.SetValueAsync("HKEY_LOCAL_MACHINE\\SYSTEM\\CurrentControlSet\\Services\\HpcScheduler\\Security", keyLocation, Convert.ToBase64String(encryptKey), token);
            await HpcContext.Get().Registry.SetValueAsync("HKEY_LOCAL_MACHINE\\SYSTEM\\CurrentControlSet\\Services\\HpcScheduler\\Security", initVectorLocation, Convert.ToBase64String(initVector), token);
        }
        else
        {
            encryptKey = Convert.FromBase64String(text);
            initVector = Convert.FromBase64String(await HpcContext.Get().Registry.GetValueAsync<string>("HKEY_LOCAL_MACHINE\\SYSTEM\\CurrentControlSet\\Services\\HpcScheduler\\Security", initVectorLocation, token));
        }
        if (encryptKey == null)
        {
            throw new InvalidOperationException();
        }
        if (!HAUtils.IsHeadNodeHAClustered())
        {
            return;
        }
        try
        {
            HAUtils.SetGenericServiceRegistryCheckpoint("HpcScheduler", "HKEY_LOCAL_MACHINE\\SYSTEM\\CurrentControlSet\\Services\\HpcScheduler\\Security");
        }
        catch (ApplicationException ex)
        {
            if (((Win32Exception)ex.InnerException).NativeErrorCode != -2147024713)
            {
                throw;
            }
        }
    }

prepares some cryptographic key that is stored in the registry and, provided the head node is in an FCM cluster, creates this snapshot to make sure that all clustered nodes have the same registry data. However, this fails as HPC Pack does not use FCM anymore.