Azure / AKS-Edge

Welcome to the Azure Kubernetes Service (AKS) Edge repo.
MIT License
53 stars 34 forks source link

WSS Agent failed to start with error message "Unable to read certificate" #153

Closed sh0pyd3v closed 2 months ago

sh0pyd3v commented 8 months ago

Describe scenario Trying to configure an offline cluster, which is also installed offline. During the deployment command it failed because the wss agent cannot be started. The certificates were installed as describe in the offline installation documentation.

Here are the logs of the wss agent: Loading configuration version 'v0.10.17-alpha.1'... Failed to get nodeagent certificate : Not Found [Store] unable to filter entity from store for type[CertificateInternal], FilterName[Name], FilterValue[NodeAgent] github.com/microsoft/moc/pkg/errors.Wrapf /home/vsts/go/pkg/mod/github.com/microsoft/moc@v0.10.17-alpha.1/pkg/errors/errors.go:123 github.com/microsoft/moc-pkg/pkg/store.(ConfigStore).filterObjects /home/vsts/go/pkg/mod/github.com/microsoft/moc-pkg@v0.10.18-alpha.3/pkg/store/store.go:653 github.com/microsoft/moc-pkg/pkg/store.(ConfigStore).ListFilter /home/vsts/go/pkg/mod/github.com/microsoft/moc-pkg@v0.10.18-alpha.3/pkg/store/store.go:584 github.com/microsoft/wssdagent/pkg/nodeagent/services/security/certificate.(Client).getCertificateByName /home/vsts/work/1/s/pkg/nodeagent/services/security/certificate/client.go:294 github.com/microsoft/wssdagent/pkg/nodeagent/services/security/certificate.(Client).Get /home/vsts/work/1/s/pkg/nodeagent/services/security/certificate/client.go:183 github.com/microsoft/wssdagent/pkg/nodeagent/services/security/certificate.(CertificateProvider).Get /home/vsts/work/1/s/pkg/nodeagent/services/security/certificate/certificate.go:40 github.com/microsoft/wssdagent/pkg/nodeagent/services/security/certificate.(CertificateProvider).GetCertificateByName /home/vsts/work/1/s/pkg/nodeagent/services/security/certificate/certificate.go:75 github.com/microsoft/wssdagent/pkg/nodeagent/services/admin/credentialmonitor.(*Client).monitorCertificateHealth /home/vsts/work/1/s/pkg/nodeagent/services/admin/credentialmonitor/client.go:431 runtime.goexit /opt/hostedtoolcache/go/1.16.15/x64/src/runtime/asm_amd64.s:1371 panic: Unable to read certificate

goroutine 55 [running]: github.com/microsoft/wssdagent/pkg/nodeagent/services/admin/credentialmonitor.(*Client).monitorCertificateHealth(0xc0002a2140) /home/vsts/work/1/s/pkg/nodeagent/services/admin/credentialmonitor/client.go:434 +0x374 created by github.com/microsoft/wssdagent/pkg/nodeagent/services/admin/credentialmonitor.NewClient /home/vsts/work/1/s/pkg/nodeagent/services/admin/credentialmonitor/client.go:102 +0x654

Question Can some help me or give me a hint, what would be the next steps to fix the problem?

Best Regards Sebastian

FooBar08 commented 8 months ago

I was running into the same issue. Are you trying to provision a cluster on a Windows machine that is not activated by any chance? After I activated Windows Server 2022 and tried an offline install again it worked.

FooBar08 commented 8 months ago

Regarding my previous comments, after some more testing offline installation stopped working again. It seems that the WSSD Agent generates a certificate, in this process it tries to contact Google's DNS (8.8.8.8) for some reason. See logging excerpt below from C:\ProgramData\wssdagent\log\agent-log-0.

{"name":"CreateCertificate","traceid":"3f56bf1e9fe862ee04236c1339404c9e","id":"e6f1ac13e060f5e0","parentid":"0000000000000000","duration":"0.21s","entity":"","Annotations":[{"Time":"2023-11-21T14:38:47.5850659+01:00","Message":"Certificate Generation Error: dial udp 8.8.8.8:80: connect: A socket operation was attempted to an unreachable network.","Attributes":{"CallerLocation":"ikv.go:57 "}}]}
{"name":"Save","traceid":"912d53466c0e880605951c2e9028a8c7","id":"d7886918c9f60df5","parentid":"0000000000000000","duration":"0.00s","entity":"","Annotations":null}
{"name":"Save","traceid":"d2e55f5f9cf72cced1a6f5c05586135a","id":"c81f261db28c2609","parentid":"0000000000000000","duration":"0.00s","entity":"","Annotations":null}
{"name":"Certificate Create NodeAgent","traceid":"27b5819bb817b5c10824e0952837a8bb","id":"132d7705259faba4","parentid":"2296ba003c099390","duration":"0.21s","entity":"","Annotations":[{"Time":"2023-11-21T14:38:47.3759748+01:00","Message":"*security.Certificate SetProvisionStatus [CREATING]","Attributes":{"CallerLocation":"internal.go:133 "}},{"Time":"2023-11-21T14:38:47.5850659+01:00","Message":"*security.Certificate SetProvisionStatus [PROVISION_FAILED]","Attributes":{"CallerLocation":"internal.go:133 "}},{"Time":"2023-11-21T14:38:47.5850659+01:00","Message":"*security.Certificate SetProvisionStatus [CREATE_FAILED]","Attributes":{"CallerLocation":"internal.go:133 "}}]}
{"name":"Bootstrap NodeAgent credentials Init...","traceid":"27b5819bb817b5c10824e0952837a8bb","id":"2296ba003c099390","parentid":"0000000000000000","duration":"0.21s","entity":"","Annotations":null}
{"name":"Updating TlsConfig","traceid":"462aa10bf2076c49f9342f7c4efebcde","id":"b9b6e2219b223f1d","parentid":"0000000000000000","duration":"0.00s","entity":"","Annotations":null}
{"name":"CredentialMonitor Init...","traceid":"055209834004f3700b9362776e62aac1","id":"7b06c9934c915dc2","parentid":"0000000000000000","duration":"0.22s","entity":"","Annotations":[{"Time":"2023-11-21T14:38:47.5850659+01:00","Message":"Error found during bootstrapNodeAgentCredentials : dial udp 8.8.8.8:80: connect: A socket operation was attempted to an unreachable network.\nCouldn't Create NodeAgent Certificate\ngithub.com/microsoft/moc/pkg/errors.Wrapf\n\t/home/vsts/go/pkg/mod/github.com/microsoft/moc@v0.10.17-alpha.1/pkg/errors/errors.go:123\ngithub.com/microsoft/wssdagent/pkg/nodeagent/services/admin/credentialmonitor.(*Client).bootstrapNodeAgentCredentials\n\t/home/vsts/work/1/s/pkg/nodeagent/services/admin/credentialmonitor/client.go:215\ngithub.com/microsoft/wssdagent/pkg/nodeagent/services/admin/credentialmonitor.NewClient\n\t/home/vsts/work/1/s/pkg/nodeagent/services/admin/credentialmonitor/client.go:82\ngithub.com/microsoft/wssdagent/pkg/nodeagent/services/admin/credentialmonitor.NewCredentialMonitorProvider\n\t/home/vsts/work/1/s/pkg/nodeagent/services/admin/credentialmonitor/credentialmonitor.go:20\ngithub.com/microsoft/wssdagent/pkg/nodeagent/services/admin/credentialmonitor.newCredentialMonitorProvider\n\t/home/vsts/work/1/s/pkg/nodeagent/services/admin/credentialmonitor/factory.go:16\nreflect.Value.call\n\t/opt/hostedtoolcache/go/1.16.15/x64/src/reflect/value.go:476\nreflect.Value.Call\n\t/opt/hostedtoolcache/go/1.16.15/x64/src/reflect/value.go:337\ngithub.com/microsoft/wssdagent/pkg/nodeagent/apis/providermanager.GetProviderWithErr\n\t/home/vsts/work/1/s/pkg/nodeagent/apis/providermanager/providermanager.go:58\ngithub.com/microsoft/wssdagent/pkg/nodeagent/apis/providermanager.GetProvider\n\t/home/vsts/work/1/s/pkg/nodeagent/apis/providermanager/providermanager.go:22\ngithub.com/microsoft/wssdagent/pkg/nodeagent/services/admin/credentialmonitor.GetCredentialMonitorProvider\n\t/home/vsts/work/1/s/pkg/nodeagent/services/admin/credentialmonitor/factory.go:12\ngithub.com/microsoft/wssdagent/pkg/nodeagent/server.CreateWssdAgentServer\n\t/home/vsts/work/1/s/pkg/nodeagent/server/nodeagent.go:85\nmain.(*myservice).Execute\n\t/home/vsts/work/1/s/cmd/nodeagent/serviceLoop_windows.go:64\ngolang.org/x/sys/windows/svc.serviceMain.func2\n\t/home/vsts/go/pkg/mod/golang.org/x/sys@v0.0.0-20211025201205-69cdffdb9359/windows/svc/service.go:222\nruntime.goexit\n\t/opt/hostedtoolcache/go/1.16.15/x64/src/runtime/asm_amd64.s:1371","Attributes":{"CallerLocation":"client.go:84 "}},{"Time":"2023-11-21T14:38:47.5850659+01:00","Message":"No CloudAgent Certificate Found ... Starting NodeAgent Stand Alone : open : The system cannot find the file specified.","Attributes":{"CallerLocation":"client.go:106 "}}]}
{"name":"Certificate Get NodeAgent","traceid":"cf687c14974997ce320a33bd922e5f5d","id":"aa4d9f2684b85731","parentid":"0000000000000000","duration":"0.00s","entity":"","Annotations":null}
{"name":"Wssdagent Startup Span","traceid":"e2a77d03a232e83516b3851b57deaf89","id":"9be45b2b6d4e7045","parentid":"0000000000000000","duration":"0.00s","entity":"","Annotations":[{"Time":"2023-11-21T14:38:47.5850659+01:00","Message":"AgentConfiguration [&{{C:\\ProgramData\\wssdagent\\v0.10.17-alpha.1 C:\\ProgramData\\wssdagent\\v0.10.17-alpha.1 WssdAgent\\v0.10.17-alpha.1 C:\\ProgramData\\wssdagent\\log C:\\ProgramData\\wssdagent\\log\\span map[]} v0.10.17-alpha.1 0.0.0.0 45000 45001 45002   C:\\ProgramData\\wssdagent []  map[authentication:0xc0000e9260 certificate:0xc0000e92d0 identity:0xc0000e9340 keyvault:0xc0000e93b0 secret:0xc0000e9420] false registry AKSEE AKSEE vmms 1}]","Attributes":{"CallerLocation":"server.go:107 "}}]}

We have a use case in which we want to run AKS EE in an airgapped environment. So some clearity on this subject is highly appreciated.

erwinkersten commented 8 months ago

Hope this information provides additional insights into the installation issue in air-gapped scenarios.

Behavior Description: In an air-gapped environment, the 'wssdagent' service (version 0.10.17-alpha.1) encounters startup issues, preventing the successful creation of AKS Edge Essentials node.

image

It seems that the wssdagent is not able to initialise the required CertificatInternal and IndentityInternal values when started:

HKLM\SOFTWARE\Microsoft\WssdAgent\v0.10.17-alpha.1\CertificateInternal
HKLM\SOFTWARE\Microsoft\WssdAgent\v0.10.17-alpha.1\IdentityInternal

Using Process Monitor (sysinternals) we can see that these are initially created (RegSetValue) but are subsequently removed (RegDeleteValue), see sceenshot:

image

Observations:

Version Information: WssdAgent Version: v0.10.17-alpha.1 Tested on Windows Server Standard 2022

erwinkersten commented 6 months ago

@sh0pyd3v and @FooBar08 - After collaborating with Microsoft on troubleshooting this issue, we discovered that it's essential to configure a default gateway on one of the network interfaces (NICs). Even if you have a single isolated subnet it is required to set a default gateway (this can be just a random non existing IP).

jagadishmurugan commented 6 months ago

if the ipaddress is set statically, ensure the DefaultGateway is also set (a Gateway does not need to exist). You might not observe the issue in a DHCP environment (without connectivity) since it automatically get the DefaultGateway.

rcheeran commented 6 months ago

@sh0pyd3v and @FooBar08 let us know if the above fix mentioned in which you configure the network adapter with IP address, subnet and Default gateway, does that resolve your problem?

FooBar08 commented 6 months ago

Just did a quick test. Setting a default gateway on the primary network adapter resolves the issue.