GoogleCloudPlatform / microservices-demo

Sample cloud-first application with 10 microservices showcasing Kubernetes, Istio, and gRPC.
https://cymbal-shops.retail.cymbal.dev
Apache License 2.0
16.72k stars 7.13k forks source link

cartservice and loadgenerator crash at start #283

Closed Nusserdt closed 4 years ago

Nusserdt commented 4 years ago

I follow the installation steps from option 3. I had already a running kubernetes cluster. So I only execute:

kubectl apply -f ./release/kubernetes-manifests.yaml

When I evaluate the result with kubectl get pods I get:

NAME                                     READY   STATUS             RESTARTS   AGE
adservice-55f9757757-9tb2h               1/1     Running            0          16m
cartservice-684bb46b44-b6dvk             0/1     CrashLoopBackOff   8          16m
checkoutservice-6fcc84467f-gm79f         1/1     Running            0          16m
currencyservice-6c7c479d45-qfnl6         1/1     Running            0          16m
emailservice-8dd9b76cc-7j6zj             1/1     Running            0          16m
frontend-7d8cfc75b5-dqnkc                1/1     Running            0          16m
loadgenerator-5db67d555-fq42k            0/1     CrashLoopBackOff   7          16m
paymentservice-84ffc75c55-mzwfx          1/1     Running            0          16m
productcatalogservice-d564bdf4c-bch2r    1/1     Running            0          16m
recommendationservice-76598d5889-p9qhm   1/1     Running            0          16m
redis-cart-5f59546cdd-rqqdf              1/1     Running            0          16m
shippingservice-b6db65f7f-t54ng          1/1     Running            0          16m

cartservice and loadgenerator are not able to start.


Logs

kubectl logs cartservice-684bb46b44-b6dvk

Started as process with id 1
Reading host address from LISTEN_ADDR environment variable
Reading cart service port from PORT environment variable
Reading redis cache address from environment variable REDIS_ADDR
Connecting to Redis: redis-cart:6379,ssl=false,allowAdmin=true,connectRetry=5
StackExchange.Redis.RedisConnectionException: It was not possible to connect to the redis server(s). UnableToConnect on redis-cart:6379/Interactive, Initializing/NotStarted, last: NONE, origin: BeginConnectAsync, outstanding: 0, last-read: 5s ago, last-write: 5s ago, keep-alive: 180s, state: Connecting, mgr: 10 of 10 available, last-heartbeat: never, global: 10s ago, v: 2.0.601.3402
   at StackExchange.Redis.ConnectionMultiplexer.ConnectImpl(Object configuration, TextWriter log) in C:\projects\stackexchange-redis\src\StackExchange.Redis\ConnectionMultiplexer.cs:line 955
   at cartservice.cartstore.RedisCartStore.EnsureRedisConnected() in /app/cartstore/RedisCartStore.cs:line 80
   at cartservice.cartstore.RedisCartStore.InitializeAsync() in /app/cartstore/RedisCartStore.cs:line 60
   at cartservice.Program.<>c__DisplayClass4_0.<<StartServer>b__0>d.MoveNext() in /app/Program.cs:line 54

kubectl logs loadgenerator-5db67d555-fq42k

./loadgen.sh: 21: ./loadgen.sh: [[: not found
+ curl --silent --output /dev/stderr --write-out %{http_code} http://frontend:80
+ STATUSCODE=000

Machine

Debian 10 behind a Proxy Local Hosting

ahmetb commented 4 years ago

It's normal cartservice may crash a few times before it's ready (since it has a loose dependency on redis).

However this error is concerning:

./loadgen.sh: 21: ./loadgen.sh: [[: not found

it seems like the underlying image somehow changed under us. https://github.com/GoogleCloudPlatform/microservices-demo/blob/master/src/loadgenerator/Dockerfile#L1 Does python:3-slim no longer have [[ executable (but it somehow has bash to execute the loadgen.sh)?

@DanSanche would be great if you have time to take a look.

daniel-sanche commented 4 years ago

Hmm I wasn't able to reproduce this. I saw some crashes on those two services early on, but they stabilized after ~2mins.

@ahmetb Looking through the logs, it looks like ./loadgen.sh: 21: ./loadgen.sh: [[: not found is a red herring. That error always shows up, even when the service is working properly. It looks like loadgen.sh is actually run with #!/bin/sh -eu on line 1 of the file, not the #!/bin/bash that comes on like 17. I can try to fix that soon

@Nusserdt I'm not sure why you're having trouble, cartservice should definitely start working within the 16 mins you waited for. Is there anything related to your network that could cause communication issues between those services? What have you tried to debug the issue?

ahmetb commented 4 years ago

#!/bin/sh -eu on line 1 of the file, not the #!/bin/bash that comes on like 17. I can try to fix that soon

yes this sounds like the culprit.

Nusserdt commented 4 years ago

@DanSanche our environment is behind a proxy, we try to add relevant proxy information. But I can confirm that cartservice defently fail to start:

NAME                                     READY   STATUS             RESTARTS   AGE
adservice-55f9757757-6js8m               1/1     Running            0          151m
cartservice-684bb46b44-g4vwt             0/1     CrashLoopBackOff   47         151m
checkoutservice-6fcc84467f-24jjh         1/1     Running            0          151m
currencyservice-6c7c479d45-zn5bk         1/1     Running            0          151m
emailservice-8dd9b76cc-jwcmx             1/1     Running            0          151m
frontend-7d8cfc75b5-sw7th                1/1     Running            0          151m
loadgenerator-5db67d555-l29l6            0/1     CrashLoopBackOff   32         151m
paymentservice-84ffc75c55-8jlqj          1/1     Running            0          151m
productcatalogservice-d564bdf4c-zz8rh    1/1     Running            0          151m
recommendationservice-76598d5889-gb4kt   1/1     Running            0          151m
redis-cart-5f59546cdd-b5m6p              1/1     Running            0          151m
shippingservice-b6db65f7f-54968          1/1     Running            0          151m

I have also the problem that the frontend service returns pending for the external-ip. Is think this is related to the failing loadgenerator service?

NAME                TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
frontend-external   LoadBalancer   10.104.218.246   <pending>     80:31248/TCP   152m
ahmetb commented 4 years ago

Is think this is related to the failing loadgenerator service?

most likely this is about the cloud provider you're using (what are you using?). run kubectl describe to see if there are any failure events there. on minikube etc, this is not supposed to work.

Nusserdt commented 4 years ago

@ahmetb We host the cluster locally on 3 debian machines. 1 master, 2 nodes. Could you specify which resouce I should use kubectl describe.

NAME                              SHORTNAMES   APIGROUP                       NAMESPACED   KIND
bindings                                                                      true         Binding
componentstatuses                 cs                                          false        ComponentStatus
configmaps                        cm                                          true         ConfigMap
endpoints                         ep                                          true         Endpoints
events                            ev                                          true         Event
limitranges                       limits                                      true         LimitRange
namespaces                        ns                                          false        Namespace
nodes                             no                                          false        Node
persistentvolumeclaims            pvc                                         true         PersistentVolumeClaim
persistentvolumes                 pv                                          false        PersistentVolume
pods                              po                                          true         Pod
podtemplates                                                                  true         PodTemplate
replicationcontrollers            rc                                          true         ReplicationController
resourcequotas                    quota                                       true         ResourceQuota
secrets                                                                       true         Secret
serviceaccounts                   sa                                          true         ServiceAccount
services                          svc                                         true         Service
mutatingwebhookconfigurations                  admissionregistration.k8s.io   false        MutatingWebhookConfiguration
validatingwebhookconfigurations                admissionregistration.k8s.io   false        ValidatingWebhookConfiguration
customresourcedefinitions         crd,crds     apiextensions.k8s.io           false        CustomResourceDefinition
apiservices                                    apiregistration.k8s.io         false        APIService
controllerrevisions                            apps                           true         ControllerRevision
daemonsets                        ds           apps                           true         DaemonSet
deployments                       deploy       apps                           true         Deployment
replicasets                       rs           apps                           true         ReplicaSet
statefulsets                      sts          apps                           true         StatefulSet
tokenreviews                                   authentication.k8s.io          false        TokenReview
localsubjectaccessreviews                      authorization.k8s.io           true         LocalSubjectAccessReview
selfsubjectaccessreviews                       authorization.k8s.io           false        SelfSubjectAccessReview
selfsubjectrulesreviews                        authorization.k8s.io           false        SelfSubjectRulesReview
subjectaccessreviews                           authorization.k8s.io           false        SubjectAccessReview
horizontalpodautoscalers          hpa          autoscaling                    true         HorizontalPodAutoscaler
cronjobs                          cj           batch                          true         CronJob
jobs                                           batch                          true         Job
certificatesigningrequests        csr          certificates.k8s.io            false        CertificateSigningRequest
leases                                         coordination.k8s.io            true         Lease
endpointslices                                 discovery.k8s.io               true         EndpointSlice
events                            ev           events.k8s.io                  true         Event
ingresses                         ing          extensions                     true         Ingress
ingresses                         ing          networking.k8s.io              true         Ingress
networkpolicies                   netpol       networking.k8s.io              true         NetworkPolicy
runtimeclasses                                 node.k8s.io                    false        RuntimeClass
poddisruptionbudgets              pdb          policy                         true         PodDisruptionBudget
podsecuritypolicies               psp          policy                         false        PodSecurityPolicy
clusterrolebindings                            rbac.authorization.k8s.io      false        ClusterRoleBinding
clusterroles                                   rbac.authorization.k8s.io      false        ClusterRole
rolebindings                                   rbac.authorization.k8s.io      true         RoleBinding
roles                                          rbac.authorization.k8s.io      true         Role
priorityclasses                   pc           scheduling.k8s.io              false        PriorityClass
csidrivers                                     storage.k8s.io                 false        CSIDriver
csinodes                                       storage.k8s.io                 false        CSINode
storageclasses                    sc           storage.k8s.io                 false        StorageClass
volumeattachments                              storage.k8s.io                 false        VolumeAttachment

We try to fix loadgen.sh by our self. What we don't understand is: how to apply changes to the pods? Do we have to execute skaffold run to "rebuild" the deployment? Unfortunately skaffold run throws also errors like:

gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -Igooglecloudprofiler/src -I/usr/local/include/python3.7m -c googlecloudprofiler/src/log.cc -o build/temp.linux-x86_64-3.7/googlecloudprofiler/src/log.o -std=c++11
  unable to execute 'gcc': No such file or directory
  error: command 'gcc' failed with exit status 1

in [emailservice] Step: 5/14

itlinux commented 4 years ago

well looks like I get the same error on cartservice crash... "cartservice-684bb46b44-2l9bd 0/1 Completed 0 11s cartservice-684bb46b44-2l9bd 0/1 Running 1 12s cartservice-684bb46b44-2l9bd 0/1 Completed 1 18s cartservice-684bb46b44-2l9bd 0/1 CrashLoopBackOff 1 24s cartservice-684bb46b44-2l9bd 0/1 Running 2 36s cartservice-684bb46b44-2l9bd 0/1 Completed 2 45s cartservice-684bb46b44-2l9bd 0/1 CrashLoopBackOff 2 54s cartservice-684bb46b44-2l9bd 0/1 Running 3 69s cartservice-684bb46b44-2l9bd 0/1 Completed 3 76s cartservice-684bb46b44-2l9bd 0/1 CrashLoopBackOff 3 84s cartservice-684bb46b44-2l9bd 0/1 Running 4 2m4s cartservice-684bb46b44-2l9bd 0/1 Completed 4 2m11s cartservice-684bb46b44-2l9bd 0/1 CrashLoopBackOff 4 2m14s ^C[root@node1 ~]# kubectl logs cartservice-684bb46b44-2l9bd Started as process with id 1 Reading host address from LISTEN_ADDR environment variable Reading cart service port from PORT environment variable Reading redis cache address from environment variable REDIS_ADDR Connecting to Redis: redis-cart:6379,ssl=false,allowAdmin=true,connectRetry=5 StackExchange.Redis.RedisConnectionException: It was not possible to connect to the redis server(s). UnableToConnect on redis-cart:6379/Interactive, Initializing/NotStarted, last: NONE, origin: BeginConnectAsync, outstanding: 0, last-read: 1s ago, last-write: 1s ago, keep-alive: 180s, state: Connecting, mgr: 10 of 10 available, last-heartbeat: never, global: 3s ago, v: 2.0.601.3402 at StackExchange.Redis.ConnectionMultiplexer.ConnectImpl(Object configuration, TextWriter log) in C:\projects\stackexchange-redis\src\StackExchange.Redis\ConnectionMultiplexer.cs:line 955 at cartservice.cartstore.RedisCartStore.EnsureRedisConnected() in /app/cartstore/RedisCartStore.cs:line 80 at cartservice.cartstore.RedisCartStore.InitializeAsync() in /app/cartstore/RedisCartStore.cs:line 60 at cartservice.Program.<>c__DisplayClass4_0.<b__0>d.MoveNext() in /app/Program.cs:line 54 [root@node1 ~]#:

Looks like cartservice endpoint is empty.

kubectl get endpoints NAME ENDPOINTS AGE adservice 10.233.92.12:9555 24m apache2 10.233.90.35:80,10.233.96.41:80 19h blue 10.233.90.41:5000,10.233.96.35:5000 5h41m cartservice 24m checkoutservice 10.233.96.36:5050

daniel-sanche commented 4 years ago

@Nusserdt Yes, skaffold run should rebuild and run the containers. Are you doing the building on debian as well? What version? Do you use Docker often? Does it usually give you issues like this? It seems strange you're having issues building the container. Docker is supposed to fix exactly these "it works on my machine" issues

@itlinux the logs look like you were waiting for only 2 minutes. Did you try letting it run a little longer? It's currently expected that the cartservice will crash a couple times until the redis service is completely ready. I may look into fixing this at some point soon

daniel-sanche commented 4 years ago

@Nusserdt Also, FWIW I don't think there are any issues with the load generator. These errors are consistent with redis not being ready. Can you post the logs from the redis pod?

itlinux commented 4 years ago

I had it running overnight and still no go. So I removed it.

Il giorno 22 gen 2020, alle ore 10:16, Daniel Sanche notifications@github.com ha scritto:

 @Nusserdt what platform are you using? Windows? Do you use Docker often? Does it usually give you issues like this? It seems strange you're having issues building the container. Docker is supposed to fix exactly these "it works on my machine" issues

@itlinux the logs look like you were waiting for only 2 minutes. Did you try letting it run a little longer? It's currently expected that the cartservice will crash a couple times until the redis service is completely ready. I may look into fixing this at some point soon

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

ahmetb commented 4 years ago

I've just deployed to GKE with kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/microservices-demo/master/release/kubernetes-manifests.yaml and I can't repro this crashloop of cartservice.

And since we changed nothing, I'm suspecting something is wrong on your side @Nusserdt. Could there be issues in your cluster networking? Since you said:

We host the cluster locally on 3 debian machines.

I'm suspecting this is a setup issue. If the project is working on GKE and locally on Docker-for-Desktop's Kubernetes (or Minikube) there's likely nothing we can do here.

That said we should fix the loadgen startup error ./loadgen.sh: 21: ./loadgen.sh: [[: not found

daniel-sanche commented 4 years ago

That said we should fix the loadgen startup error ./loadgen.sh: 21: ./loadgen.sh: [[: not found

This issue should be fixed in https://github.com/GoogleCloudPlatform/microservices-demo/pull/284

Nusserdt commented 4 years ago

@DanSanche we use skaffold v1.1.0 and it likly don't work cause I am not able to pass our proxy inforamtion. With docker I can configure ~/.docker/config.json and after that docker build works like a charm. We run the build also on the dibian 10 (master) machine.

Now, I uncomment the lines which was resonable for the error:

#if [[ -z "${FRONTEND_ADDR}" ]]; then
#    echo >&2 "FRONTEND_ADDR not specified"
#    exit 1
#fi

and update the Docker Image, push it to our regestry and replace the entry inside the kubernetes-manifests.yaml. But the loadgenerator pod still failling:

NAME                                     READY   STATUS             RESTARTS   AGE
adservice-55f9757757-j9mhs               1/1     Running            1          19h
cartservice-684bb46b44-f8s6b             0/1     CrashLoopBackOff   329        19h
checkoutservice-6fcc84467f-x8cp6         1/1     Running            1          19h
currencyservice-6c7c479d45-pklv5         1/1     Running            1          19h
emailservice-8dd9b76cc-8lx7j             1/1     Running            1          19h
frontend-7d8cfc75b5-tzp9h                1/1     Running            1          19h
loadgenerator-76875cfd5f-kn5m5           0/1     CrashLoopBackOff   6          31m
paymentservice-84ffc75c55-vlb6j          1/1     Running            1          19h
productcatalogservice-d564bdf4c-kqb27    1/1     Running            1          19h
recommendationservice-76598d5889-cdsxs   1/1     Running            1          19h
redis-cart-5f59546cdd-fzxdv              1/1     Running            2          19h
shippingservice-b6db65f7f-l9blv          1/1     Running            1          19h

Now, the log only returns:

++ curl --silent --output /dev/stderr --write-out '%{http_code}' http://frontend:80
+ STATUSCODE=000

How can I investigate what goes wrong here?


@ahmetb our cluster configuration looks like:

apiVersion: v1
clusters:
- cluster:
    insecure-skip-tls-verify: true
    server: https://192.168.76.101:6443
  name: kubernetes
contexts:
- context:
    cluster: kubernetes
    user: kubernetes-admin
  name: kubernetes-admin@kubernetes
current-context: kubernetes-admin@kubernetes
kind: Config
preferences: {}
users:
- name: kubernetes-admin
  user:
    client-certificate-data: REDACTED
    client-key-data: REDACTED

Could the flannel-Framework a problem here (https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml)? What else could go wrong? :/

daniel-sanche commented 4 years ago

The reason loadgenerator and cartservice are crashing is because they're having trouble communicating with the redis-cart service. My guess is that it's due to your proxy - there may be some network setting in your environment that is preventing those services from communicating.

My advice would be to try to do some debugging with kubectl port-forward and kubectl exec to try to get to the bottom of it, but I'm not able to reproduce your issue, so I likely won't be much help here.

ahmetb commented 4 years ago

Could the flannel-Framework a problem here (coreos/flannel:Documentation/kube-flannel.yml@master (raw))?

Yes, that's why we don't have bandwidth to support a custom setup. :) If it's works on Minikube and GKE, that's likely a setup issue you have that I recommend you seek help in other channels.

ahmetb commented 4 years ago

Closing as we can't do much here.