LM-Development / aks-sample

Community project providing an undeprecated Microsoft Teams bot sample that runs on Azure Kubernetes Service
https://github.com/LM-Development/aks-sample/tree/main/Samples/PublicSamples/RecordingBot
MIT License
1 stars 0 forks source link

ERR_CERT_AUTHORITY_INVALID certificate error #59

Closed osamabinsaleem closed 1 month ago

osamabinsaleem commented 1 month ago

Despite after waiting for a lot of time I still see the screen that says 'Your connection is not private'.

I followed all the steps mentioned in the tutorial and the outputs were pretty much the same as given. The only difference was when I ran this command:

 helm dependency build .\deploy\teams-recording-bot\

I got this error:

Error: no repository definition for https://kubernetes.github.io/ingress-nginx. Please add the missing repos via 'helm repo add'

So then I ran this command successfully:

>>  helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
"ingress-nginx" has been added to your repositories

And after that my build was successfull like this:


PS C:\Users\osama\Desktop\MS_Bot\aks-sample\Samples\PublicSamples\RecordingBot> helm dependency build .\deploy\teams-recording-bot\
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "ingress-nginx" chart repository
...Successfully got an update from the "jetstack" chart repository
Update Complete. ⎈Happy Helming!⎈
Saving 1 charts
Downloading ingress-nginx from repo https://kubernetes.github.io/ingress-nginx
Deleting outdated charts

If you need output of anymore commands please let me know. Also, I'm using windows 11 so that shouldn't be an issue

1fabi0 commented 1 month ago

Thanks, I will add the need to add nginx repo to our todos,

regarding your issue can you try run kubectl describe certificate ingress-tls-recordingbottutorial --namespace recordingbottutorial as described in the info box it is possible that let's encrypt reached a certificate limit for the azure region in that case you either have to wait until the limit resets or use a cname entry of a custom domain that points to the azure domain.

InDieTasten commented 1 month ago

Thanks for reaching out. When cert-manager is installed, it creates a bunch of CRDs. Can you query those CRDs in your cluster?

Things like cluster issuer, certificate requests, challenges and so on? When you describe those resources, it might explain why it's unable to pull certificates.

I'm wondering if you have a ClusterIssuer resource

osamabinsaleem commented 1 month ago

@1fabi0 I think you're right. Let's encrypt reached a limit for that region but I'm not sure why tho. This is what I see when I execute the command that you've given:

 Issuer Ref:
    Group:      cert-manager.io
    Kind:       ClusterIssuer
    Name:       recordingbottutorial-issuer
  Secret Name:  ingress-tls-recordingbottutorial
  Usages:
    digital signature
    key encipherment
Status:
  Conditions:
    Last Transition Time:    2024-05-14T19:41:47Z
    Message:                 Issuing certificate as Secret does not exist
    Observed Generation:     1
    Reason:                  DoesNotExist
    Status:                  False
    Type:                    Ready
    Last Transition Time:    2024-05-15T02:41:53Z
    Message:                 The certificate request has failed to complete and will be retried: Failed to wait for order resource "ingress-tls-recordingbottutorial-1-3163291491" to become ready: order is in "errored" state: Failed to create Order: 429 urn:ietf:params:acme:error:rateLimited: Error creating new order :: too many certificates already issued for "eastus.cloudapp.azure.com". Retry after 2024-05-16T05:00:00Z: see https://letsencrypt.org/docs/rate-limits/
    Observed Generation:     1
    Reason:                  Failed
    Status:                  False
    Type:                    Issuing
  Failed Issuance Attempts:  4
  Last Failure Time:         2024-05-15T02:41:53Z
Events:                      <none>
osamabinsaleem commented 1 month ago

I tried after waiting as stated in the error message but I'm still getting the same behavior. Should I expect it to work or act differently if I deploy the bot to the different region?

1fabi0 commented 1 month ago

If you have a DNS you can configure a CName entry that points to the azure domain of the aks cluster then update the bot service with your DNS name and redeploy to the cluster with your own DNS name as host. You could also try to delete the not signed certificate with kubectl delete certificate ingress-tls-recordingbottutorial --namespace recordingbottutorial, which you might need to do anyway if you change the DNS name.

InDieTasten commented 1 month ago

And yes, a different region would result in a different FQDN sub-domain for your built-in domain of the IP. AFAWK, only eastus is currently experiencing rate-limit issues

osamabinsaleem commented 1 month ago

So before working on the custom domain, I deployed everything in a new region i.e westus but I still see the same error i.e NET::ERR_CERT_AUTHORITY_INVALID . If I execute kubectl describe certificate ingress-tls-recordingbottutorial --namespace recordingbottutorial I can see that the certifcate is created this time:

  Issuer Ref:
    Group:      cert-manager.io
    Kind:       ClusterIssuer
    Name:       recordingbottutorial-issuer
  Secret Name:  ingress-tls-recordingbottutorial
  Usages:
    digital signature
    key encipherment
Status:
  Conditions:
    Last Transition Time:  2024-05-21T16:23:55Z
    Message:               Certificate is up to date and has not expired
    Observed Generation:   1
    Reason:                Ready
    Status:                True
    Type:                  Ready
  Not After:               2024-08-19T15:23:51Z
  Not Before:              2024-05-21T15:23:52Z
  Renewal Time:            2024-07-20T15:23:51Z
  Revision:                1
Events:                    <none>

Any idea why this could be happening now?

InDieTasten commented 1 month ago

Hard to tell. If the cert is there, then it should also utilize it during connection between browser and nginx. Can you see the certificate details in your browser?

osamabinsaleem commented 1 month ago

This is what I see

Screenshot 2024-05-22 at 10 21 28 AM Screenshot 2024-05-22 at 10 21 45 AM
1fabi0 commented 1 month ago

Interesting, you got a staging certificate from let's encrypt. Did you change anything in the chart or did it just deliver the staging certificate to you... If you didn't change anything can you please run kubectl describe clusterissuer recordingbottutorial-issuer

osamabinsaleem commented 1 month ago

I tried to fix the limit issue before by trying the stagin certifcate and forgot to revert back the change. My bad. Can I rebuild and deploy again?

1fabi0 commented 1 month ago

Yes you can, maybe you need to delete the certificate with kubectl delete certificate ingress-tls-recordingbottutorial --namespace recordingbottutorial

osamabinsaleem commented 1 month ago

I delete the certificate like this:

kubectl delete certificate ingress-tls-recordingbottutorial --namespace recordingbottutorial
certificate.cert-manager.io "ingress-tls-recordingbottutorial" deleted

And then I deployed it again. When I run this kubectl describe clusterissuer recordingbottutorial-issuer I get the following output:

Name:         recordingbottutorial-issuer
Namespace:
Labels:       app.kubernetes.io/managed-by=Helm
              helmAppVersion=1.3.1
              helmName=teams-recording-bot
              helmVersion=1.4.1
Annotations:  meta.helm.sh/release-name: recordingbottutorial
              meta.helm.sh/release-namespace: recordingbottutorial
API Version:  cert-manager.io/v1
Kind:         ClusterIssuer
Metadata:
  Creation Timestamp:  2024-05-21T15:22:43Z
  Generation:          2
  Resource Version:    620248
  UID:                 888e8e38-e91a-4419-8bcd-784320e69db0
Spec:
  Acme:
    Email:  tls-security@lm-ag.de
    Private Key Secret Ref:
      Name:  recordingbottutorial-issuer
    Server:  https://acme-v02.api.letsencrypt.org/directory
    Solvers:
      http01:
        Ingress:
          Ingress Class Name:  recordingbottutorial-ingress-nginx
Status:
  Acme:
    Last Private Key Hash:  fr9Yby//pm+e/RxOAPFfqLlD+rNy5PS3BOgaIrDN5w8=
    Last Registered Email:  tls-security@lm-ag.de
    Uri:                    https://acme-v02.api.letsencrypt.org/acme/acct/1739907162
  Conditions:
    Last Transition Time:  2024-05-21T15:22:44Z
    Message:               The ACME account was registered with the ACME server
    Observed Generation:   2
    Reason:                ACMEAccountRegistered
    Status:                True
    Type:                  Ready
Events:                    <none>
osamabinsaleem commented 1 month ago

and if I run this: kubectl describe certificate ingress-tls-recordingbottutorial --namespace recordingbottutorial , I see this:

tutorial
Name:         ingress-tls-recordingbottutorial
Namespace:    recordingbottutorial
Labels:       app.kubernetes.io/managed-by=Helm
              helmAppVersion=1.3.1
              helmName=teams-recording-bot
              helmVersion=1.4.1
Annotations:  <none>
API Version:  cert-manager.io/v1
Kind:         Certificate
Metadata:
  Creation Timestamp:  2024-05-22T09:38:31Z
  Generation:          1
  Owner References:
    API Version:           networking.k8s.io/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Ingress
    Name:                  recordingbottutorial
    UID:                   22eb9988-b03e-42ab-bebc-72389ef0995a
  Resource Version:        618859
  UID:                     8b5ef662-b7a4-4e04-8bc4-761107248cf3
Spec:
  Dns Names:
    ppppppppppppppppp
  Issuer Ref:
    Group:      cert-manager.io
    Kind:       ClusterIssuer
    Name:       recordingbottutorial-issuer
  Secret Name:  ingress-tls-recordingbottutorial
  Usages:
    digital signature
    key encipherment
Status:
  Conditions:
    Last Transition Time:  2024-05-22T09:38:32Z
    Message:               Certificate is up to date and has not expired
    Observed Generation:   1
    Reason:                Ready
    Status:                True
    Type:                  Ready
  Not After:               2024-08-19T15:23:51Z
  Not Before:              2024-05-21T15:23:52Z
  Renewal Time:            2024-07-20T15:23:51Z
Events:                    <none>

But when I try to reload the page I get the same error and If I try to see the certificate, I'm still seeing the 'Staging' label

1fabi0 commented 1 month ago

That Looks good, if it still doesn't work you can try to delete the certificate again as you first deleted and then deployed your change, so it might have pulled from the staging in between again. Regarding the Windows pods you might need to restart them e.g. restart the cluster so they pull the correct certificate.(This could also apply for the nginx pods that they need to reload)

Also Thanks for working with our email address from the example 😂

osamabinsaleem commented 1 month ago

haha I forgot to chage that.

also while trying to restart the windows pods, I've noticed one thing:

 kubectl get pods -n recordingbottutorial
NAME                                                             READY   STATUS             RESTARTS      AGE
recordingbottutorial-0                                           0/1     CrashLoopBackOff   8 (60s ago)   19m
recordingbottutorial-1                                           0/1     CrashLoopBackOff   8 (17s ago)   19m
recordingbottutorial-2                                           0/1     CrashLoopBackOff   8 (28s ago)   19m
recordingbottutorial-ingress-nginx-controller-5bc86dbdd5-xwngh   1/1     Running            0             17m

Is it expected? The pods are in CrashLoopBackOff state

1fabi0 commented 1 month ago

It is not expected(they should start as soon they get to load a certificate) maybe you can describe a pod and see whats the issue is during start or even try to run kubectl logs for the pods. I could imagine that it does not have a certificate as you now maybe need to reopen the page with your browser, or that it also got the staging certificate now :expressionless:

osamabinsaleem commented 1 month ago

Thanks a lot for your continous help. I really appreciate that.

This is what I see:

Setup: Starting VC_redist
Setup: Converting certificate
Setup: Installing certificate
Certificate "test.cloudapp.azure.com" added to store.

CertUtil: -importPFX command completed successfully.
Setup: Deleting bindings
Setup: Adding bindings
Setup: Done
---------------------
RecordingBot: booting
fail: RecordingBot.Console[0]
      Unhandled exception in Boot()
      Status Code: 0
      Microsoft.Graph.Communications.Core.Exceptions.ServiceException: Media platform failed to initialize
       ---> System.InvalidOperationException: MediaPlatform needs a system with at least 2 cores for creation
         at Microsoft.Skype.Internal.Bots.Media.InternalMediaPlatform.Initialize(MediaPlatformSettings settings, IConfigurationManager configurationManager, Boolean isTest)
         at Microsoft.Skype.Bots.Media.MediaPlatform.Initialize(MediaPlatformSettings settings, IConfigurationManager configManager, Boolean isTest)
         at Microsoft.Skype.Bots.Media.MediaPlatform.Initialize(MediaPlatformSettings settings)
         at Microsoft.Graph.Communications.Calls.Media.MediaCommunicationsClientBuilderExtensions.SetMediaPlatformSettings(ICommunicationsClientBuilder statefulClientBuilder, MediaPlatformSettings mediaSettings)
         --- End of inner exception stack trace ---
         at Microsoft.Graph.Communications.Calls.Media.MediaCommunicationsClientBuilderExtensions.SetMediaPlatformSettings(ICommunicationsClientBuilder statefulClientBuilder, MediaPlatformSettings mediaSettings)
         at RecordingBot.Services.Bot.BotService.InitializeClient() in C:\src\RecordingBot.Services\Bot\BotService.cs:line 63
         at RecordingBot.Services.Bot.BotService.Initialize() in C:\src\RecordingBot.Services\Bot\BotService.cs:line 51
         at RecordingBot.Services.ServiceSetup.AppHost.Boot(String[] args) in C:\src\RecordingBot.Services\ServiceSetup\AppHost.cs:line 75

This also has my bot domain details

1fabi0 commented 1 month ago

I think you need to delete your certificate again and restart your cluster, as it seems like you still got the staging certficate, and your windows pods also already pulled the certificate into the store. I don't believe that the nodes are too weak, but as the default replicaCount might be to big for the number of nodes deployed in the tutorial. So if certificate delete and cluster restart does not fix your problem you could also try to deploy only two replicas with the scale.replicaCount Option

helm upgrade recordingbottutorial .\deploy\teams-recording-bot\ 
    --install 
    --namespace recordingbottutorial 
    --set image.registry="recordingbotregistry.azurecr.io/recordingbottutorial" 
    --set image.name="application" 
    --set image.tag="latest" 
    --set public.ip="255.255.255.255" 
    --set host="recordingbottutorial.westeurope.cloudapp.azure.com" 
    --set ingress.tls.email="tls-security@lm-ag.de"  
    --set scale.replicaCount=2
osamabinsaleem commented 1 month ago

I think the deletion is not working permannetly and it being created again automatically. I delete the certificate and immediately check it again like:

 kubectl get certificates -n recordingbottutorial
NAME                               READY   SECRET                             AGE
ingress-tls-recordingbottutorial   False   ingress-tls-recordingbottutorial   9s
1fabi0 commented 1 month ago

As you shared the domain, this now seems to be a valid Certificate from Let's encrypt 😉

osamabinsaleem commented 1 month ago

Thanks @1fabi0 I need to remove that now :)

Also, I'm still seeing 503 Service Unavailable after waiting for quite a while. Is it becasue of the CrashLoopBackOff state of the pods?

1fabi0 commented 1 month ago

Yes exactly, if you scale down the replica count to 2 instances and it works please let me know, then I'll include that into the Tutorial

osamabinsaleem commented 1 month ago

I update the help with this flag --set scale.replicaCount=2 but still two pods are in the CrashLoopBackOff state. Would I need to delete them so that they can be restarted again?

1fabi0 commented 1 month ago

You could try, what do you see if you describe the pods?

osamabinsaleem commented 1 month ago

I see this error at the end of the describe log

Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  certificate:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  ingress-tls-recordingbottutorial
    Optional:    false
  kube-api-access-xxt5b:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                           Age                    From     Message
  ----     ------                           ----                   ----     -------
  Warning  FailedToRetrieveImagePullSecret  2m3s (x624 over 132m)  kubelet  Unable to retrieve some image pull secrets (acr-secret); attempting to pull the image may not succeed.
1fabi0 commented 1 month ago

Did you change the node size or are you using D2s_v3?

osamabinsaleem commented 1 month ago

I'm using standard_d2s_v3

1fabi0 commented 1 month ago

And what do the logs say right now (kubectl logs)?

osamabinsaleem commented 1 month ago

Still the same:

 kubectl logs recordingbottutorial-0 -n recordingbottutorial
Setup: Starting VC_redist
Setup: Converting certificate
Setup: Installing certificate
Certificate "test.cloudapp.azure.com" added to store.

CertUtil: -importPFX command completed successfully.
Setup: Deleting bindings
Setup: Adding bindings
Setup: Done
---------------------
RecordingBot: booting
fail: RecordingBot.Console[0]
      Unhandled exception in Boot()
      Status Code: 0
      Microsoft.Graph.Communications.Core.Exceptions.ServiceException: Media platform failed to initialize
       ---> System.InvalidOperationException: MediaPlatform needs a system with at least 2 cores for creation
         at Microsoft.Skype.Internal.Bots.Media.InternalMediaPlatform.Initialize(MediaPlatformSettings settings, IConfigurationManager configurationManager, Boolean isTest)
         at Microsoft.Skype.Bots.Media.MediaPlatform.Initialize(MediaPlatformSettings settings, IConfigurationManager configManager, Boolean isTest)
         at Microsoft.Skype.Bots.Media.MediaPlatform.Initialize(MediaPlatformSettings settings)
         at Microsoft.Graph.Communications.Calls.Media.MediaCommunicationsClientBuilderExtensions.SetMediaPlatformSettings(ICommunicationsClientBuilder statefulClientBuilder, MediaPlatformSettings mediaSettings)
         --- End of inner exception stack trace ---
         at Microsoft.Graph.Communications.Calls.Media.MediaCommunicationsClientBuilderExtensions.SetMediaPlatformSettings(ICommunicationsClientBuilder statefulClientBuilder, MediaPlatformSettings mediaSettings)
         at RecordingBot.Services.Bot.BotService.InitializeClient() in C:\src\RecordingBot.Services\Bot\BotService.cs:line 63
         at RecordingBot.Services.Bot.BotService.Initialize() in C:\src\RecordingBot.Services\Bot\BotService.cs:line 51
         at RecordingBot.Services.ServiceSetup.AppHost.Boot(String[] args) in C:\src\RecordingBot.Services\ServiceSetup\AppHost.cs:line 75
1fabi0 commented 1 month ago

I think the Problem is now about the containers not recognizing the two cores, can you please open a new issue for that, as your certificate issue seems to be solved