Open mukshe01 opened 7 months ago
Hello @mukshe01,
We don't have support custom attributes export/import.
currently we are using a lambda trigger in aws to populate these custom attributes. this is causing an issue with ServiceExportReconciler , its keep re registering instances in aws cloudmap as endpoint/instance definition is different in aws cloudmap to what controller desires . is there a way to stop this re register from happening?.
The code compares the endpoints IP/Port with CloudMap instances. I don't believe changing the attributes will trigger re-registration.
Hi Akash,
Many thanks for your response,
we are trying to understand what change is triggering the re registrations. in kubernates the app exposed by the service hasnt restarted for last 16 hours means the ip shouldnt change, also the port 8080 stays constant.
$ date Thu Dec 7 09:55:51 UTC 2023
$ kubectl get po rest-api-develop-fc4d7778c-pqss5 -n participant1-develop -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
rest-api-develop-fc4d7778c-pqss5 1/1 Running 0 16h 172.24.126.219 ip-172-24-126-165.ec2.internal
this is kubernates service config.
$ kubectl describe svc rest-api-develop -n participant1-develop
Name: rest-api-develop
Namespace: participant1-develop
Labels: app.kubernetes.io/instance=rest-api-develop
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=rest-api
app.kubernetes.io/version=1.16.0
helm.sh/chart=rest-api-0.1.0
Annotations: meta.helm.sh/release-name: rest-api-develop
meta.helm.sh/release-namespace: participant1-develop
Selector: app.kubernetes.io/instance=rest-api-develop,app.kubernetes.io/name=rest-api
Type: NodePort
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.100.234.225
IPs: 10.100.234.225
Port: ra-main 8080/TCP
TargetPort: 8080/TCP
NodePort: ra-main 30679/TCP
Endpoints: 172.24.126.219:8080
Session Affinity: None
External Traffic Policy: Cluster
Events:
below are logs of two occurrences of endpoint re registration for namespace participant1-develop, service rest-api-develop. as you see its registering instances/endpoints to cloudmap with same ip(ip didnt change), ideally we expect it shouldnt update cloudmap as there is no change. i have attached full logs of controller for your reference.
{"level":"info","ts":1701937389.3627253,"logger":"controllers.ServiceExportReconciler","msg":"updating Cloud Map service","namespace":"participant1-develop","name":"rest-api-develop"} {"level":"info","ts":1701937389.3627617,"logger":"cloudmap.client","msg":"fetching a service","namespace":"participant1-develop","name":"rest-api-develop"} {"level":"info","ts":1701937389.3628213,"logger":"cloudmap.client","msg":"registering endpoints","namespaceName":"participant1-develop","serviceName":"rest-api-develop","endpoints":[{"Id":"tcp-172_24_12621 9-8080","IP":"172.24.126.219","EndpointPort":{"Name":"ra-main","Port":8080,"TargetPort":"","Protocol":"TCP"},"ServicePort":{"Name":"ra-main","Port":8080,"TargetPort":"8080","Protocol":"TCP"},"ClusterId":"1s ource-develop","ClusterSetId":"1source","ServiceType":"ClusterSetIP","ServiceExportCreationTimestamp":1694697956000,"Ready":true,"Hostname":"","Nodename":"ip-172-24-126-165.ec2.internal","Attributes":{"K8S CONTROLLER":"aws-cloud-map-mcs-controller-for-k8s 887737e (887737e)"}}]}
would you help us understand what change is triggering the update?.
cloud-map-mcs-controller.log is there a way to increase logging level and to change timestamp format to human readable? (we are running the controller using command kubectl apply -k "mcs-controller/controller_install_release" after downloading everything from https://github.com/aws/aws-cloud-map-mcs-controller-for-k8s/tree/main/config locally, we can change this file config/default/manager_auth_proxy_patch.yaml to add additional arg here if you tell us what it should be to increase tracing and change time format).
Regards Shekhar
You can use this reference to customize the logging https://sdk.operatorframework.io/docs/building-operators/golang/references/logging/
Hi,
Thank you for that we changed loglevel to debug(--zap-log-level debug), and captured the logs when re registration of endpoints happened. looking at logs it is not clear why this happened, as the endpoint app did not restart and endpoint ips did not change.
here is the full log.
log snippet when re registering endpoints.
sterSetId":"1source"}
2023-12-08T16:35:56.058213089Z stderr F {"level":"debug","ts":1702053356.0581286,"logger":"controllers.ServiceExportReconciler","msg":"reconciling ServiceExport","Namespace":"participant1-develo
p","Name":{"namespace":"participant1-develop","name":"rest-api-develop"}}
2023-12-08T16:35:56.058251185Z stderr F {"level":"info","ts":1702053356.0581875,"logger":"controllers.ServiceExportReconciler","msg":"updating Cloud Map service","namespace":"participant1-develo
p","name":"rest-api-develop"}
2023-12-08T16:35:56.05827025Z stderr F {"level":"info","ts":1702053356.058198,"logger":"cloudmap.client","msg":"fetching a service","namespace":"participant1-develop","name":"rest-api-develop"}
2023-12-08T16:35:56.058335948Z stderr F {"level":"info","ts":1702053356.0582664,"logger":"cloudmap.client","msg":"registering endpoints","namespaceName":"participant1-develop","serviceName":"res
t-api-develop","endpoints":[{"Id":"tcp-172_24_126_204-8080","IP":"172.24.126.204","EndpointPort":{"Name":"ra-main","Port":8080,"TargetPort":"","Protocol":"TCP"},"ServicePort":{"Name":"ra-main","
Port":8080,"TargetPort":"8080","Protocol":"TCP"},"ClusterId":"1source-develop","ClusterSetId":"1source","ServiceType":"ClusterSetIP","ServiceExportCreationTimestamp":1694697956000,"Ready":false,
"Hostname":"","Nodename":"ip-172-24-126-165.ec2.internal","Attributes":{"K8S_CONTROLLER":"aws-cloud-map-mcs-controller-for-k8s 887737e (887737e)"}},{"Id":"tcp-172_24_126_219-8080","IP":"172.24.1
26.219","EndpointPort":{"Name":"ra-main","Port":8080,"TargetPort":"","Protocol":"TCP"},"ServicePort":{"Name":"ra-main","Port":8080,"TargetPort":"8080","Protocol":"TCP"},"ClusterId":"1source-deve
lop","ClusterSetId":"1source","ServiceType":"ClusterSetIP","ServiceExportCreationTimestamp":1694697956000,"Ready":true,"Hostname":"","Nodename":"ip-172-24-126-165.ec2.internal","Attributes":{"K8
S_CONTROLLER":"aws-cloud-map-mcs-controller-for-k8s 887737e (887737e)"}}]}
2023-12-08T16:35:57.981590837Z stderr F {"level":"debug","ts":1702053357.9814687,"logger":"controllers.CloudmapReconciler","msg":"clusterProperties found","ClusterId":"1source-develop","ClusterS
etId":"1source"}
2023-12-08T16:35:57.981632268Z stderr F {"level":"debug","ts":1702053357.9815843,"logger":"controllers.CloudmapReconciler","msg":"syncing namespace","namespace":"participant1-develop"}
2023-12-08T16:35:57.988809874Z stderr F {"level":"debug","ts":1702053357.98873,"logger":"controllers.CloudmapReconciler","msg":"syncing service","namespace":"participant1-develop","service":"res
t-api-develop"}
2023-12-08T16:35:57.98900082Z stderr F {"level":"debug","ts":1702053357.9889314,"logger":"controllers.CloudmapReconciler","msg":"updating EndpointSlice","namespace":"participant1-develop","name"
:"imported-vonhveonrv-m6dsh"}
2023-12-08T16:35:57.997397891Z stderr F {"level":"debug","ts":1702053357.9972944,"logger":"controllers.ServiceExportReconciler","msg":"clusterProperties found","ClusterId":"1source-develop","Clu
sterSetId":"1source"}
2023-12-08T16:35:57.998135175Z stderr F {"level":"debug","ts":1702053357.998055,"logger":"controllers.CloudmapReconciler","msg":"syncing namespace","namespace":"participant2-develop"}
2023-12-08T16:35:58.002388689Z stderr F {"level":"debug","ts":1702053358.0023232,"logger":"controllers.CloudmapReconciler","msg":"syncing service","namespace":"participant2-develop","service":"r
est-api-develop"}
we have 3 service export objects in kubernates, when re registration happens it happens for all 3 services.
we are seeing this re registration behavior happening every few hours and we are unclear what is triggering this. each occurrence of this issue results in few seconds of service outage.
would you help us how to prevent this happening?,
Thank you Regards
Shekhar
On reviewing the ServiceExport controller code, it seems we are exporting the endpoints based on the k8s events on the Service and EndpointSlice. Updating metadata will trigger the k8s events, thus it's getting overridden.
Hi Akash,
we compared Service, EndpointSlice k8s objects definitions for effected endpoints before and after the occurrence of the problem and they are exactly identical, we do not see any metadata difference (command ran kubectl describe endpointslices rest-api-staging-xkqrz -n participant1-staging).
here is an example endpoint slice.
also we captured kubernates events for last 24 hours of namespace the service is in , we do not see any k8s event at exact time the problem occurs. are we missing anything?.
On a side note we created service export objects for 2 new services(just to test) within same cluster yesterday. we did not see the problem occurred for the new services. however for original service it occurred 8 to 10 times in last 24 hours. 1 of test service runs on same k8s node(ec2) and other runs on different node. interestingly the only difference between the test services and live service is the custom attributes we populated in cloudmap for the instances using a lambda.
Would you help us to get to the bottom of this issue, please let us know if you need any information from our side.
Regards Shekhar
The controller watches the kubernetes service and endpoint slice resources. The code for that is abstracted out by operator sdk. Perhaps you can subscribe to the events and debug on what attribute change is triggering reconciliation.
Hi Akash,
we captured all k8s events in the cluster for last few hours, attached the events dump file. also attaching mcs controller log.
mcs_controller_log_20231212.txt events_log_20231212.txt
as we see in mcs controller log controller is re reregistering instance tcp-172_24_125_249-8080 at 2023-12-12T11:45:55.
Name":{"namespace":"participant1-qar","name":"rest-api-qar"}}
2023-12-12T11:45:55.611488108Z stderr F {"level":"info","ts":1702381555.6114109,"logger":"controllers.ServiceExportReconciler","msg":"updating Cloud Map service","namespace":"participant1-qar","
name":"rest-api-qar"}
2023-12-12T11:45:55.611498026Z stderr F {"level":"info","ts":1702381555.6114464,"logger":"cloudmap.client","msg":"fetching a service","namespace":"participant1-qar","name":"rest-api-qar"}
2023-12-12T11:45:55.617349345Z stderr F {"level":"info","ts":1702381555.6172726,"logger":"cloudmap.client","msg":"registering endpoints","namespaceName":"participant1-qar","serviceName":"rest-ap
i-qar","endpoints":[{"Id":"tcp-172_24_125_249-8080","IP":"172.24.125.249","EndpointPort":{"Name":"ra-main","Port":8080,"TargetPort":"","Protocol":"TCP"},"ServicePort":{"Name":"ra-main","Port":80
80,"TargetPort":"8080","Protocol":"TCP"},"ClusterId":"1source-qar","ClusterSetId":"1source","ServiceType":"ClusterSetIP","ServiceExportCreationTimestamp":1700579083000,"Ready":true,"Hostname":""
,"Nodename":"ip-172-24-125-8.ec2.internal","Attributes":{"K8S_CONTROLLER":"aws-cloud-map-mcs-controller-for-k8s 887737e (887737e)"}}]}
we do not see any K8s events at this time. we see some events at 2023-12-12T11:40:42(5 minutes before the issue) of type SuccessfullyReconciled for targetGroupBinding,
.[0K4.
6","resourceVersion":"32067978","creationTimestamp":"2023-12-12T11:40:42Z","managedFields":[{"manager":"controller","operation":"Update","apiVersion":"v1","time":"2023-12-12T11:40:42Z"}]},"invol
.[0Kv.
edObject":{"kind":"TargetGroupBinding","namespace":"participant3-qar","name":"rest-api-qar","uid":"ac1dd5d8-ae3c-4023-a087-8b7ce140f3ad","apiVersion":"elbv2.k8s.aws/v1beta1","resourceVersion":"2
.[0K4.
208797"},"reason":"SuccessfullyReconciled","message":"Successfully reconciled","source":{"component":"targetGroupBinding"},"firstTimestamp":"2023-12-11T06:47:40Z","lastTimestamp":"2023-12-12T11:
.[0K4.
0:42Z","count":28,"type":"Normal","eventTime":null,"reportingComponent":"","reportingInstance":""}}.[0K.
we can confirm these events wont update service/endpointslices meta data as we captured and compared service/endpointslices k8s objects before and after the issue and they are identical. also these k8s events gets generated exactly once every hour and we are not seeing the issue happening every hour.
would you be able to help us to identify which event is causing the controller to re register the endpoints.
Hi,
Just to update on this,
We are still not sure what is triggering this change. however we are sure that the issue only occurs to instances to which we add custom attributes in aws cloudmap.
we have 4 similar applications running on same K8s node, the issue happens to 3 instances (around same time), for 1 it wont happen. the 1 instance being the one for which we are not altering aws cloudmap attributes. we we added custom attribute to this instance in cloudmap the issue started happening.
Regards Shekhar.
Hello,
There is no code logic within the controller that triggers the re-registration. The controller subscribes to kubernetes updates events for the kind ServiceExport, Service, EndpointSlice and ClusterProperty. Unfortunately we don't have enough observability to say what kind-of update is triggering an event.
You can choose to fork the controller and implement the attributes reconciliation. The way to approach this will be to add the instance attributes as ServiceExport object annotations
kind: ServiceExport
apiVersion: multicluster.x-k8s.io/v1alpha1
metadata:
namespace: example
name: my-service
annotations:
attr1: val1
attr2: val2
And then update the serviceexport_controller.go to export the annotations as attributes.
You can push the change as a PR, and we can review the code.
And a disclaimer that the mcs controller is in alpha stage. Should be used in production system with caution.
Hi,
We are running v0.3.1 of aws-cloud-map-mcs-controller to register services to aws cloudmap. we have another application looks up cloudmap for service discovery. we have a requirement to populate some custom attributes for cloud map instances.
is there a way to define custom attributes in ServiceExport definition of kubernates object?. currently its defined like this.
kind: ServiceExport apiVersion: multicluster.x-k8s.io/v1alpha1 metadata: namespace: namespace1 name: service-name
currently we are using a lambda trigger in aws to populate these custom attributes. this is causing an issue with ServiceExportReconciler , its keep re registering instances in aws cloudmap as endpoint/instance definition is different in aws cloudmap to what controller desires . is there a way to stop this re register from happening?.
Thank you.