Closed lifeofmoo closed 7 months ago
This is very telling!
If I tail the logs of the collector receiver pod which is deployed in the opentelemetry NS I see this.
2024-04-02T09:22:05.348Z warn batchprocessor@v0.94.1/batch_processor.go:258 Sender failed {"kind": "processor", "name": "batch/traces", "pipeline": "traces", "error": "Permanent error: AccessDeniedException: User: arn:aws:sts::12345678910:assumed-role/KarpenterNodeRole-eksdev/i-0741c2f458623f36d is not authorized to perform: xray:PutTraceSegments because no identity-based policy allows the xray:PutTraceSegments action\n\tstatus code: 403, request id: 5fea62ce-68d8-4fc9-af83-32ac7bd93366"}
This used to work prior to the migration BECAUSE the adot-operator IAM role and k8s ServiceAccount was created and then referenced in the collector deployment. However the latest docs saying that the IAM role (adot-col-otlp-ingest) is supposed to be ROLE ONLY (i,e do not create the associated k8s account). Which i've already highlighted above.
Are you deploying your own OpenTelemetryCollector custom resource or using the preconfigured adot-otlp-ingest collector? Can you share your advanced configuration?
If you are deploying your own OpenTelemetryCollector and not using the preconfigured one then you need to create the service account also. The migration guide is only for users who were using the preconfigured collector deployments available through the advanced configuration.
Hi @bryan-aguilar, good to hear from you again!
I get what your saying about having to create the SA for my use-case as that was my hunch a long.
I originally approached this migration as a fresh install.
However when that didn't work I also applied my own OpenTelemetryCollector in the opentelemery namespace. You can see the OpenTelemetryCollector in the original post. All I've done is commented out the ServiceAccount as this doesn't exist at the moment (following the fresh install approach).
What would it take to get this working as a fresh install (without my customer OpenTelemetryCollector), because I believe I am passing the IAM role correctly in the json file during the adot installation.
Could you share your v0.88.0 ADOT EKS Add-on advanced configuration? That would give me more insight into what is required for the migration.
yep, it's all in the original post.
I completely uninstalled the old adot add-on and install the latest version using this command and config json file in order to annotate the install.
aws eks create-addon \
--cluster-name eksdev \
--addon-name adot \
--configuration-values file://configuration-values.json \
--resolve-conflicts=OVERWRITE
The below is the contents of the configuration-values.json file.
{
"collector": {
"otlpIngest": {
"serviceAccount": {
"annotations": {
"eks.amazonaws.com/role-arn": "arn:aws:iam::123456789101:role/adot-col-otlp-ingest"
}
}
}
}
}
Ahh, I apologize, can I see the advanced configuration you used for before trying to migrate to v0.88.0?
These were the full steps for me getting this to work before v0.88.0.
The original IAM role was created via eksctl for the opentelemetry NS
eksctl create iamserviceaccount \
--name adot-collector \
--namespace opentelemetry \
--cluster eksdev \
--attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess \
--attach-policy-arn arn:aws:iam::aws:policy/AWSXrayWriteOnlyAccess \
--attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \
--approve \
--override-existing-serviceaccounts
The Add-on was installed using this command - I didn't have to apply any advanced config for the install.
aws eks create-addon --addon-name adot --cluster-name eksstg -addon-version v0.82.0-eksbuild.1 --service-account-role-arn arn:aws:iam::12345678910:role/eksctl-eksdev-addon-iamserviceaccount-opente-Role1-xxxxxxxx --resolve-conflicts Overwrite
kubectl get all -n opentelemetry-operator-system
NAME READY STATUS RESTARTS AGE
pod/opentelemetry-operator-5988dc7cd5-26p5x 2/2 Running 0 5m49s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/opentelemetry-operator ClusterIP 10.100.247.55 <none> 8443/TCP,8080/TCP 5m51s
service/opentelemetry-operator-webhook ClusterIP 10.100.216.94 <none> 443/TCP 5m51s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/opentelemetry-operator 1/1 1 1 5m51s
The original adot-collector.yaml has the serviceAccount: adot-collector
uncommented.
kubectl apply -f adot-collector.yaml -n opentelemetry
kubectl get all -n opentelemetry
NAME READY STATUS RESTARTS AGE
pod/collector-xray-collector-6555869687-r4qbz 1/1 Running 0 58s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/collector-xray-collector ClusterIP 10.100.239.143 <none> 4317/TCP,4318/TCP 58s
service/collector-xray-collector-headless ClusterIP None <none> 4317/TCP,4318/TCP 58s
service/collector-xray-collector-monitoring ClusterIP 10.100.35.133 <none> 8888/TCP 58s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/collector-xray-collector 1/1 1 1 58s
NAME DESIRED CURRENT READY AGE
replicaset.apps/collector-xray-collector-6555869687 1 1 1 58s
To recap - the Adot add in installed in the opentelemetry-operator-system NS and the collector was applied in the opentelemetry NS. This worked perfectly.
Since you were not using the advanced configuration the migration guide does not apply to you. You should be able to follow the same steps listed above and receive the same results for versions >= v0.88.0.
The roles referenced in the migration guide are for users who are using the ADOT EKS add-ons preconfigured collector deployments such as OTLP Ingest ADOT Collector. In your case you are deploying and managing your own OpenTelemetry Collector custom resource so you will have to manage the service account also.
Ok, i'll give that a go tomorrow.
I'm still struggling to get this to work when using tthe auto instrumentation docs - which also used to work before the migration.
I've re-created the IAM role with a serviceAccount - deployed my Customer collector and and now references this ServiceAccount following but apps which have been auto instrumented are not appearing in x-ray.
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: learning-auth-api-instrumentation
spec:
exporter:
endpoint: http://collector-xray-collector.opentelemetry:4317
java:
image: public.ecr.aws/aws-observability/adot-autoinstrumentation-java:v1.32.1
apiVersion: apps/v1
kind: Deployment
metadata:
# namespace: nextjs
name: learning-auth-api
spec:
template:
spec:
containers:
- name: learning-auth-api
env:
- name: AWS_REGION
value: eu-west-1
- name: CLUSTER_NAME
value: eksdev
- name: LISTEN_ADDRESS
value: 0.0.0.0:8080
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: http://collector-xray-collector.opentelemetry:4317
- name: OTEL_RESOURCE_ATTRIBUTES
value: service.namespace=learning,service.name=learning-auth-api
- name: OTEL_SERVICE_NAME
value: learning-auth-api
- name: OTEL_TRACES_EXPORTER
value: otlp
- name: OTEL_METRICS_EXPORTER
value: none
This is what I ran:
eksctl create iamserviceaccount \
--name adot-col-otlp-ingest \
--namespace opentelemetry \
--role-name adot-col-otlp-ingest \
--cluster eksdev \
--attach-policy-arn arn:aws:iam::aws:policy/AWSXrayWriteOnlyAccess \
--tags CostCentre=operations \
--approve
k get sa -n opentelemetry
NAME SECRETS AGE
adot-col-otlp-ingest 0 16m
k describe sa -n opentelemetry adot-col-otlp-ingest
Name: adot-col-otlp-ingest
Namespace: opentelemetry
Labels: app.kubernetes.io/managed-by=eksctl
Annotations: eks.amazonaws.com/role-arn: arn:aws:iam::12345678910:role/adot-col-otlp-ingest
ADOT is still installed in the opentelemetry-operator-system NS
My customer receiver (below) is deployed in the opentelemetry NS
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: collector-xray
spec:
mode: deployment
resources:
requests:
cpu: "1"
limits:
cpu: "1"
serviceAccount: adot-col-otlp-ingest
config: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch/traces:
timeout: 1s
send_batch_size: 50
resourcedetection/eks:
detectors: [env, eks]
timeout: 2s
override: false
exporters:
awsxray:
region: eu-west-1
index_all_attributes: true
logging:
loglevel: debug
service:
pipelines:
traces:
receivers: [otlp]
processors: [resourcedetection/eks, batch/traces]
exporters: [awsxray]
telemetry:
logs:
level: debug
Going right back to the beginning of this ticket. I was more than happy to do a complete fresh install and go via the recommended route - I keep reviewing the docs and it feels like my previous setup is deprecated and not worth hanging onto.
However even a complete fresh install didn't work. So am i'm at a bit of a loss now.
Can you double check to make sure everything is being installed into the namespace you intend them to? For example, I see in the above examples that you have installed the service account and OpenTelemetryCollector into different namespaces.
This may be a copy/paste ommission but just need to make sure.
If everything is indeed installed in the correct namespaces then I would start by looking at the instrumented application logs, and collector logs. If the neither of them have errors, then I would suggest adding the logging
exporter to your trace pipeline to verify that the collector is received trace data.
The different Namespaces was a deliberate thing. I've re-created the adot and IAM/SA role in the opentelemetry-operator-system.
I changed my Java app from my old exporter endpoint to as per this.
http://adot-col-otlp-ingest-collector:4317
The old endpoint as per my custom collector was: http://collector-xray-collector.opentelemetry:4317
note the original customer endoint has an additional opentelemetry domain in the url.
This is how my java app is being auto instrumented.
These are the logs I see in the pod.
{"level":"info","ts":"2024-04-09T08:23:00Z","msg":"Skipping pod instrumentation - already instrumented","namespace":"learning","name":""}
{"level":"info","ts":"2024-04-09T08:23:02Z","msg":"Skipping pod instrumentation - already instrumented","namespace":"learning","name":""}
I see this error when I redeploy the same app and traffic generator
[otel.javaagent 2024-04-09 08:54:57:063 +0000] [OkHttp http://adot-col-otlp-ingest-collector:4317/...] ERROR io.opentelemetry.exporter.internal.grpc.OkHttpGrpcExporter - Failed to export metrics. The request could not be executed. Full error message: adot-col-otlp-ingest-collector
should I be worried that the last activity used hasn't registered yet?
I've deployed to add to make sure xray pipelines are enabled.
kubectl get all -n opentelemetry-operator-system
NAME READY STATUS RESTARTS AGE
pod/adot-col-otlp-ingest-collector-7d76b567ff-msc2h 1/1 Running 0 119m
pod/opentelemetry-operator-85d8596db5-hwbdq 2/2 Running 0 122m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/adot-col-otlp-ingest-collector ClusterIP 10.100.60.53 <none> 4317/TCP,4318/TCP 119m
service/adot-col-otlp-ingest-collector-headless ClusterIP None <none> 4317/TCP,4318/TCP 119m
service/adot-col-otlp-ingest-collector-monitoring ClusterIP 10.100.204.39 <none> 8888/TCP 119m
service/opentelemetry-operator ClusterIP 10.100.228.26 <none> 8443/TCP,8080/TCP 122m
service/opentelemetry-operator-webhook ClusterIP 10.100.7.175 <none> 443/TCP 122m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/adot-col-otlp-ingest-collector 1/1 1 1 119m
deployment.apps/opentelemetry-operator 1/1 1 1 122m
NAME DESIRED CURRENT READY AGE
replicaset.apps/adot-col-otlp-ingest-collector-7d76b567ff 1 1 1 119m
replicaset.apps/opentelemetry-operator-85d8596db5 1 1 1 122m
k logs pod/adot-col-otlp-ingest-collector-7d76b567ff-msc2h
2024/04/09 09:11:07 ADOT Collector version: v0.38.1
2024/04/09 09:11:07 found no extra config, skip it, err: open /opt/aws/aws-otel-collector/etc/extracfg.txt: no such file or directory
2024-04-09T09:11:07.288Z info service@v0.94.1/telemetry.go:59 Setting up own telemetry...
2024-04-09T09:11:07.288Z info service@v0.94.1/telemetry.go:104 Serving metrics {"address": ":8888", "level": "Basic"}
2024-04-09T09:11:07.290Z info service@v0.94.1/service.go:140 Starting aws-otel-collector... {"Version": "v0.38.1", "NumCPU": 8}
2024-04-09T09:11:07.290Z info extensions/extensions.go:34 Starting extensions...
2024-04-09T09:11:07.290Z warn internal@v0.94.1/warning.go:42 Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks. Enable the feature gate to change the default and remove this warning. {"kind": "receiver", "name": "otlp", "data_type": "traces", "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks", "feature gate ID": "component.UseLocalHostAsDefaultHost"}
2024-04-09T09:11:07.291Z info otlpreceiver@v0.94.1/otlp.go:102 Starting GRPC server {"kind": "receiver", "name": "otlp", "data_type": "traces", "endpoint": "0.0.0.0:4317"}
2024-04-09T09:11:07.291Z warn internal@v0.94.1/warning.go:42 Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks. Enable the feature gate to change the default and remove this warning. {"kind": "receiver", "name": "otlp", "data_type": "traces", "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks", "feature gate ID": "component.UseLocalHostAsDefaultHost"}
2024-04-09T09:11:07.291Z info otlpreceiver@v0.94.1/otlp.go:152 Starting HTTP server {"kind": "receiver", "name": "otlp", "data_type": "traces", "endpoint": "0.0.0.0:4318"}
2024-04-09T09:11:07.291Z info service@v0.94.1/service.go:166 Everything is ready. Begin running and processing data.
2024-04-09T09:11:07.291Z warn localhostgate/featuregate.go:63 The default endpoints for all servers in components will change to use localhost instead of 0.0.0.0 in a future version. Use the feature gate to preview the new default. {"feature gate ID": "component.UseLocalHostAsDefaultHost"}
Feel so close as I'm now seeing both pods in the single NS.
As mentioned earlier you shouldn't need to use the migration guide at all if you were not using any advanced configuration parameters pre v0.88.0. I see you have now enabled the otlp ingest collector in the advanced configuration though.
I think http://adot-col-otlp-ingest-collector:4317
is a mistake in the migration guide. It's missing the namespace and should instead be http://adot-col-otlp-ingest-collector.opentelemetry-operator-system:4317
i've changed the endpoint to:
http://adot-col-otlp-ingest-collector.opentelemetry-operator-system:4317
Still see this message when I delete/recreate the apps:
{"level":"info","ts":"2024-04-09T16:32:28Z","msg":"Skipping pod instrumentation - already instrumented","namespace":"learning","name":"learning-auth-api-56466c5b6c-7lzsv"}
{"level":"info","ts":"2024-04-09T16:33:24Z","msg":"Skipping pod instrumentation - already instrumented","namespace":"learning","name":""}
{"level":"info","ts":"2024-04-09T16:40:02Z","msg":"Skipping pod instrumentation - already instrumented","namespace":"learning","name":""}
{"level":"info","ts":"2024-04-09T16:42:00Z","msg":"Skipping pod instrumentation - already instrumented","namespace":"etocs","name":""}
{"level":"info","ts":"2024-04-09T16:42:01Z","msg":"Skipping pod instrumentation - already instrumented","namespace":"etocs","name":""}
{"level":"info","ts":"2024-04-09T16:42:01Z","msg":"Skipping pod instrumentation - already instrumented","namespace":"etocs","name":""}
k get instrumentations.opentelemetry.io -A
NAMESPACE NAME AGE ENDPOINT SAMPLER SAMPLER ARG
etocs etocs-generator-frontend-instrumentation 22s http://adot-col-otlp-ingest-collector.opentelemetry-operator-system:4317
learning learning-auth-api-instrumentation 2m20s http://adot-col-otlp-ingest-collector.opentelemetry-operator-system:4317
Are you receiving export errors anymore within the application after changing the endpoint? Such as this?
[otel.javaagent 2024-04-09 08:54:57:063 +0000] [OkHttp http://adot-col-otlp-ingest-collector:4317/...] ERROR io.opentelemetry.exporter.internal.grpc.OkHttpGrpcExporter - Failed to export metrics. The request could not be executed. Full error message: adot-col-otlp-ingest-collector
These logs
{"level":"info","ts":"2024-04-09T16:32:28Z","msg":"Skipping pod instrumentation - already instrumented","namespace":"learning","name":"learning-auth-api-56466c5b6c-7lzsv"}
{"level":"info","ts":"2024-04-09T16:33:24Z","msg":"Skipping pod instrumentation - already instrumented","namespace":"learning","name":""}
are not errors but just informing you that it is not trying to re-instrument the pod because it has detected it has already been instrumented.
very interesting!
This works when I deploy these files to a new NS called otel
I think I am able to get auto-instrumentation working for Java apps via two different clusters using the default export in the dev cluster and my original custom one the stg cluster.
I may revert entirely to the OpenTelemetry Collector custom resource as I like the ability to add addtional config. So I have to replicate this in our 3rd cluster to be sure I have a set of reproducible patterns.
Question for you please:
If I want to be able to use configs like:
exporters:
awsxray:
region: eu-west-1
index_all_attributes: true < THIS
logging:
loglevel: debug
service:
pipelines:
traces:
receivers: [otlp]
processors: [resourcedetection/eks, batch/traces] < THIS
exporters: [awsxray]
I must use my own OpenTelemetry Collector custom resource?
I must use my own OpenTelemetry Collector custom resource?
yes
Does having the amazon-cloudwatch-observability add-on conflict with the ADOT add-on?
I've got the amazon-cloudwatch-observability installed on the dev cluster - YET the same application which is also deployed in stg cluster reports into x-ray fine. However it doesn't for the same app in dev.
Even though the deployments are the same - and I've confirmed the sample app (i've posted about before) works fine for all 3 clusters (dev, stg, live).
[otel.javaagent 2024-04-11 13:23:07:802 +0000] [OkHttp http://cloudwatch-agent.amazon-cloudwatch:4315/...] ERROR io.opentelemetry.exporter.internal.grpc.GrpcExporter - Failed to export spans. Server responded with UNIMPLEMENTED. This usually means that your collector is not configured with an otlp receiver in the "pipelines" section of the configuration. If export is not desired and you are using OpenTelemetry autoconfiguration or the javaagent, disable export by setting OTEL_TRACES_EXPORTER=none. Full error message: unknown service opentelemetry.proto.collector.trace.v1.TraceService
It looks like a bunch of CW values "pollute" my manifests for the app in Dev.
Yes, there is a conflict between the two addons when trying to use autoinstrumentation. Some of that is mentioned in the compatability docs. https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Application-Signals-compatibility.html.
What has happened is that you have installed the CW observability add-on and then by enabling auto instrumentation injection into your workload you have opted into application signals.
Thanks - that's useful to be aware off. How on earth did this all work before the migration then? I've have the amazon-cloudwatch-observability installed for a few months now.
I also can't work out what I need to do within my k8s manifests to get auto instrumentation to work again.
What needs commenting out / adding?
I think the incompatibility was introduced in newer versions of the observability add-on and ADOT Java Agent. For the time being there is no way to stop the observabilty add-on from mutating the workload environment when you have autoinstrumenation annotation enabled.
This means your workloads environment will continue to be populated with the enviornment variables that are breaking your use case. I believe the only way to stop this would be to uninstall the observability add-on.
Uninstalling the observability add-on has done the trick. Having Otel is our priority at the moment as we're looking to move away from Appdynamics so this unblocks our migration.
Do you know if there is a plan/roadmap to have these add-ons work together with they way i've done auto-instumentation?
Appreciate it's a very fast moving subject matter and breaking changes/functionality are expected.
I have brought up the issue with the observability add-on team but don't have any additional information to share yet.
Thanks for all your help on this @bryan-aguilar. Have a good weekend!
Hello,
Trying to resolve x-ray traces from eks after Adot Migration
I've had OTEL running well for nearly a year in our eks clusters - all was well until I tried the migration!
The old version docs said to do the following (high level):
The migration docs say the IAM roles accounts should now be split into 2 roles, which is fine. However the command to create both the new roles, in my case the otel ingest (adot-col-otlp-ingest) is to pass the --role-only flag. i.e not create the k8s ServiceAccount.
By default the service account will be created or updated to include the role annotation, this can be disabled using the flag --role-only.
The docs also go on to say:
This IAM role generated by the above command needs to be inserted into the annotations field of the advanced configuration as seen below:
I completely uninstall the old adot add-on and install the latest version using this command and config json file in order to annotate the install.
I'm assuming that I've done the above correctly. So when I proceed to re-apply same x-ray collector receiver. All I've done now is commented out the Service Account line and apply this is a namespace called opentelemetry (as before).
However I fail to see existing services which were happily reporting before.
I have seen this but I can't make sense of it.
Even if I deploy the sample app and traffic generator pointing to adot-col-otlp-ingest-collector, I don't see anything. sample-app.txt traffic-generator.txt