Closed osamabinsaleem closed 1 month ago
Thanks, I will add the need to add nginx repo to our todos,
regarding your issue can you try run kubectl describe certificate ingress-tls-recordingbottutorial --namespace recordingbottutorial
as described in the info box it is possible that let's encrypt reached a certificate limit for the azure region in that case you either have to wait until the limit resets or use a cname entry of a custom domain that points to the azure domain.
Thanks for reaching out. When cert-manager is installed, it creates a bunch of CRDs. Can you query those CRDs in your cluster?
Things like cluster issuer, certificate requests, challenges and so on? When you describe those resources, it might explain why it's unable to pull certificates.
I'm wondering if you have a ClusterIssuer resource
@1fabi0 I think you're right. Let's encrypt reached a limit for that region but I'm not sure why tho. This is what I see when I execute the command that you've given:
Issuer Ref:
Group: cert-manager.io
Kind: ClusterIssuer
Name: recordingbottutorial-issuer
Secret Name: ingress-tls-recordingbottutorial
Usages:
digital signature
key encipherment
Status:
Conditions:
Last Transition Time: 2024-05-14T19:41:47Z
Message: Issuing certificate as Secret does not exist
Observed Generation: 1
Reason: DoesNotExist
Status: False
Type: Ready
Last Transition Time: 2024-05-15T02:41:53Z
Message: The certificate request has failed to complete and will be retried: Failed to wait for order resource "ingress-tls-recordingbottutorial-1-3163291491" to become ready: order is in "errored" state: Failed to create Order: 429 urn:ietf:params:acme:error:rateLimited: Error creating new order :: too many certificates already issued for "eastus.cloudapp.azure.com". Retry after 2024-05-16T05:00:00Z: see https://letsencrypt.org/docs/rate-limits/
Observed Generation: 1
Reason: Failed
Status: False
Type: Issuing
Failed Issuance Attempts: 4
Last Failure Time: 2024-05-15T02:41:53Z
Events: <none>
I tried after waiting as stated in the error message but I'm still getting the same behavior. Should I expect it to work or act differently if I deploy the bot to the different region?
If you have a DNS you can configure a CName entry that points to the azure domain of the aks cluster then update the bot service with your DNS name and redeploy to the cluster with your own DNS name as host. You could also try to delete the not signed certificate with kubectl delete certificate ingress-tls-recordingbottutorial --namespace recordingbottutorial
, which you might need to do anyway if you change the DNS name.
And yes, a different region would result in a different FQDN sub-domain for your built-in domain of the IP. AFAWK, only eastus is currently experiencing rate-limit issues
So before working on the custom domain, I deployed everything in a new region i.e westus but I still see the same error i.e NET::ERR_CERT_AUTHORITY_INVALID
. If I execute kubectl describe certificate ingress-tls-recordingbottutorial --namespace recordingbottutorial
I can see that the certifcate is created this time:
Issuer Ref:
Group: cert-manager.io
Kind: ClusterIssuer
Name: recordingbottutorial-issuer
Secret Name: ingress-tls-recordingbottutorial
Usages:
digital signature
key encipherment
Status:
Conditions:
Last Transition Time: 2024-05-21T16:23:55Z
Message: Certificate is up to date and has not expired
Observed Generation: 1
Reason: Ready
Status: True
Type: Ready
Not After: 2024-08-19T15:23:51Z
Not Before: 2024-05-21T15:23:52Z
Renewal Time: 2024-07-20T15:23:51Z
Revision: 1
Events: <none>
Any idea why this could be happening now?
Hard to tell. If the cert is there, then it should also utilize it during connection between browser and nginx. Can you see the certificate details in your browser?
This is what I see
Interesting, you got a staging certificate from let's encrypt. Did you change anything in the chart or did it just deliver the staging certificate to you... If you didn't change anything can you please run kubectl describe clusterissuer recordingbottutorial-issuer
I tried to fix the limit issue before by trying the stagin certifcate and forgot to revert back the change. My bad. Can I rebuild and deploy again?
Yes you can, maybe you need to delete the certificate with kubectl delete certificate ingress-tls-recordingbottutorial --namespace recordingbottutorial
I delete the certificate like this:
kubectl delete certificate ingress-tls-recordingbottutorial --namespace recordingbottutorial
certificate.cert-manager.io "ingress-tls-recordingbottutorial" deleted
And then I deployed it again. When I run this kubectl describe clusterissuer recordingbottutorial-issuer
I get the following output:
Name: recordingbottutorial-issuer
Namespace:
Labels: app.kubernetes.io/managed-by=Helm
helmAppVersion=1.3.1
helmName=teams-recording-bot
helmVersion=1.4.1
Annotations: meta.helm.sh/release-name: recordingbottutorial
meta.helm.sh/release-namespace: recordingbottutorial
API Version: cert-manager.io/v1
Kind: ClusterIssuer
Metadata:
Creation Timestamp: 2024-05-21T15:22:43Z
Generation: 2
Resource Version: 620248
UID: 888e8e38-e91a-4419-8bcd-784320e69db0
Spec:
Acme:
Email: tls-security@lm-ag.de
Private Key Secret Ref:
Name: recordingbottutorial-issuer
Server: https://acme-v02.api.letsencrypt.org/directory
Solvers:
http01:
Ingress:
Ingress Class Name: recordingbottutorial-ingress-nginx
Status:
Acme:
Last Private Key Hash: fr9Yby//pm+e/RxOAPFfqLlD+rNy5PS3BOgaIrDN5w8=
Last Registered Email: tls-security@lm-ag.de
Uri: https://acme-v02.api.letsencrypt.org/acme/acct/1739907162
Conditions:
Last Transition Time: 2024-05-21T15:22:44Z
Message: The ACME account was registered with the ACME server
Observed Generation: 2
Reason: ACMEAccountRegistered
Status: True
Type: Ready
Events: <none>
and if I run this: kubectl describe certificate ingress-tls-recordingbottutorial --namespace recordingbottutorial
, I see this:
tutorial
Name: ingress-tls-recordingbottutorial
Namespace: recordingbottutorial
Labels: app.kubernetes.io/managed-by=Helm
helmAppVersion=1.3.1
helmName=teams-recording-bot
helmVersion=1.4.1
Annotations: <none>
API Version: cert-manager.io/v1
Kind: Certificate
Metadata:
Creation Timestamp: 2024-05-22T09:38:31Z
Generation: 1
Owner References:
API Version: networking.k8s.io/v1
Block Owner Deletion: true
Controller: true
Kind: Ingress
Name: recordingbottutorial
UID: 22eb9988-b03e-42ab-bebc-72389ef0995a
Resource Version: 618859
UID: 8b5ef662-b7a4-4e04-8bc4-761107248cf3
Spec:
Dns Names:
ppppppppppppppppp
Issuer Ref:
Group: cert-manager.io
Kind: ClusterIssuer
Name: recordingbottutorial-issuer
Secret Name: ingress-tls-recordingbottutorial
Usages:
digital signature
key encipherment
Status:
Conditions:
Last Transition Time: 2024-05-22T09:38:32Z
Message: Certificate is up to date and has not expired
Observed Generation: 1
Reason: Ready
Status: True
Type: Ready
Not After: 2024-08-19T15:23:51Z
Not Before: 2024-05-21T15:23:52Z
Renewal Time: 2024-07-20T15:23:51Z
Events: <none>
But when I try to reload the page I get the same error and If I try to see the certificate, I'm still seeing the 'Staging' label
That Looks good, if it still doesn't work you can try to delete the certificate again as you first deleted and then deployed your change, so it might have pulled from the staging in between again. Regarding the Windows pods you might need to restart them e.g. restart the cluster so they pull the correct certificate.(This could also apply for the nginx pods that they need to reload)
Also Thanks for working with our email address from the example 😂
haha I forgot to chage that.
also while trying to restart the windows pods, I've noticed one thing:
kubectl get pods -n recordingbottutorial
NAME READY STATUS RESTARTS AGE
recordingbottutorial-0 0/1 CrashLoopBackOff 8 (60s ago) 19m
recordingbottutorial-1 0/1 CrashLoopBackOff 8 (17s ago) 19m
recordingbottutorial-2 0/1 CrashLoopBackOff 8 (28s ago) 19m
recordingbottutorial-ingress-nginx-controller-5bc86dbdd5-xwngh 1/1 Running 0 17m
Is it expected? The pods are in CrashLoopBackOff
state
It is not expected(they should start as soon they get to load a certificate) maybe you can describe a pod and see whats the issue is during start or even try to run kubectl logs for the pods. I could imagine that it does not have a certificate as you now maybe need to reopen the page with your browser, or that it also got the staging certificate now :expressionless:
Thanks a lot for your continous help. I really appreciate that.
This is what I see:
Setup: Starting VC_redist
Setup: Converting certificate
Setup: Installing certificate
Certificate "test.cloudapp.azure.com" added to store.
CertUtil: -importPFX command completed successfully.
Setup: Deleting bindings
Setup: Adding bindings
Setup: Done
---------------------
RecordingBot: booting
fail: RecordingBot.Console[0]
Unhandled exception in Boot()
Status Code: 0
Microsoft.Graph.Communications.Core.Exceptions.ServiceException: Media platform failed to initialize
---> System.InvalidOperationException: MediaPlatform needs a system with at least 2 cores for creation
at Microsoft.Skype.Internal.Bots.Media.InternalMediaPlatform.Initialize(MediaPlatformSettings settings, IConfigurationManager configurationManager, Boolean isTest)
at Microsoft.Skype.Bots.Media.MediaPlatform.Initialize(MediaPlatformSettings settings, IConfigurationManager configManager, Boolean isTest)
at Microsoft.Skype.Bots.Media.MediaPlatform.Initialize(MediaPlatformSettings settings)
at Microsoft.Graph.Communications.Calls.Media.MediaCommunicationsClientBuilderExtensions.SetMediaPlatformSettings(ICommunicationsClientBuilder statefulClientBuilder, MediaPlatformSettings mediaSettings)
--- End of inner exception stack trace ---
at Microsoft.Graph.Communications.Calls.Media.MediaCommunicationsClientBuilderExtensions.SetMediaPlatformSettings(ICommunicationsClientBuilder statefulClientBuilder, MediaPlatformSettings mediaSettings)
at RecordingBot.Services.Bot.BotService.InitializeClient() in C:\src\RecordingBot.Services\Bot\BotService.cs:line 63
at RecordingBot.Services.Bot.BotService.Initialize() in C:\src\RecordingBot.Services\Bot\BotService.cs:line 51
at RecordingBot.Services.ServiceSetup.AppHost.Boot(String[] args) in C:\src\RecordingBot.Services\ServiceSetup\AppHost.cs:line 75
This also has my bot domain details
I think you need to delete your certificate again and restart your cluster, as it seems like you still got the staging certficate, and your windows pods also already pulled the certificate into the store. I don't believe that the nodes are too weak, but as the default replicaCount might be to big for the number of nodes deployed in the tutorial. So if certificate delete and cluster restart does not fix your problem you could also try to deploy only two replicas with the scale.replicaCount Option
helm upgrade recordingbottutorial .\deploy\teams-recording-bot\
--install
--namespace recordingbottutorial
--set image.registry="recordingbotregistry.azurecr.io/recordingbottutorial"
--set image.name="application"
--set image.tag="latest"
--set public.ip="255.255.255.255"
--set host="recordingbottutorial.westeurope.cloudapp.azure.com"
--set ingress.tls.email="tls-security@lm-ag.de"
--set scale.replicaCount=2
I think the deletion is not working permannetly and it being created again automatically. I delete the certificate and immediately check it again like:
kubectl get certificates -n recordingbottutorial
NAME READY SECRET AGE
ingress-tls-recordingbottutorial False ingress-tls-recordingbottutorial 9s
As you shared the domain, this now seems to be a valid Certificate from Let's encrypt 😉
Thanks @1fabi0 I need to remove that now :)
Also, I'm still seeing 503 Service Unavailable after waiting for quite a while. Is it becasue of the CrashLoopBackOff state of the pods?
Yes exactly, if you scale down the replica count to 2 instances and it works please let me know, then I'll include that into the Tutorial
I update the help with this flag --set scale.replicaCount=2
but still two pods are in the CrashLoopBackOff state. Would I need to delete them so that they can be restarted again?
You could try, what do you see if you describe the pods?
I see this error at the end of the describe log
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
certificate:
Type: Secret (a volume populated by a Secret)
SecretName: ingress-tls-recordingbottutorial
Optional: false
kube-api-access-xxt5b:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedToRetrieveImagePullSecret 2m3s (x624 over 132m) kubelet Unable to retrieve some image pull secrets (acr-secret); attempting to pull the image may not succeed.
Did you change the node size or are you using D2s_v3?
I'm using standard_d2s_v3
And what do the logs say right now (kubectl logs)?
Still the same:
kubectl logs recordingbottutorial-0 -n recordingbottutorial
Setup: Starting VC_redist
Setup: Converting certificate
Setup: Installing certificate
Certificate "test.cloudapp.azure.com" added to store.
CertUtil: -importPFX command completed successfully.
Setup: Deleting bindings
Setup: Adding bindings
Setup: Done
---------------------
RecordingBot: booting
fail: RecordingBot.Console[0]
Unhandled exception in Boot()
Status Code: 0
Microsoft.Graph.Communications.Core.Exceptions.ServiceException: Media platform failed to initialize
---> System.InvalidOperationException: MediaPlatform needs a system with at least 2 cores for creation
at Microsoft.Skype.Internal.Bots.Media.InternalMediaPlatform.Initialize(MediaPlatformSettings settings, IConfigurationManager configurationManager, Boolean isTest)
at Microsoft.Skype.Bots.Media.MediaPlatform.Initialize(MediaPlatformSettings settings, IConfigurationManager configManager, Boolean isTest)
at Microsoft.Skype.Bots.Media.MediaPlatform.Initialize(MediaPlatformSettings settings)
at Microsoft.Graph.Communications.Calls.Media.MediaCommunicationsClientBuilderExtensions.SetMediaPlatformSettings(ICommunicationsClientBuilder statefulClientBuilder, MediaPlatformSettings mediaSettings)
--- End of inner exception stack trace ---
at Microsoft.Graph.Communications.Calls.Media.MediaCommunicationsClientBuilderExtensions.SetMediaPlatformSettings(ICommunicationsClientBuilder statefulClientBuilder, MediaPlatformSettings mediaSettings)
at RecordingBot.Services.Bot.BotService.InitializeClient() in C:\src\RecordingBot.Services\Bot\BotService.cs:line 63
at RecordingBot.Services.Bot.BotService.Initialize() in C:\src\RecordingBot.Services\Bot\BotService.cs:line 51
at RecordingBot.Services.ServiceSetup.AppHost.Boot(String[] args) in C:\src\RecordingBot.Services\ServiceSetup\AppHost.cs:line 75
I think the Problem is now about the containers not recognizing the two cores, can you please open a new issue for that, as your certificate issue seems to be solved
Despite after waiting for a lot of time I still see the screen that says 'Your connection is not private'.
I followed all the steps mentioned in the tutorial and the outputs were pretty much the same as given. The only difference was when I ran this command:
I got this error:
So then I ran this command successfully:
And after that my build was successfull like this:
If you need output of anymore commands please let me know. Also, I'm using windows 11 so that shouldn't be an issue