Closed lorelei-rupp-imprivata closed 1 year ago
I was not able to replicate this on my local system using this config.
exporters:
awsxray:
region: us-west-2
local_mode: true
no_verify_ssl: true
awsemf:
region: us-west-2
dimension_rollup_option: NoDimensionRollup
log_group_name: /aws/containerinsights/testtest/performance
log_retention: 7
log_stream_name: 'testNodeStreamName'
logging:
loglevel: debug
I'm wondering if the fact that it's deployed through a deamonset has something to do with it. @lorelei-rupp-imprivata could you try reproducing this with a minimal config and daemonset? If you can reproduce again can you share the deamonset manifest? I also wonder if there if something is getting messed up with the {ClusterName}
substitution. I'm not familiar with this code and will loop @humivo in. Could you also try a fixed log group
and log stream
name for testing @lorelei-rupp-imprivata?
Let me try a fixed log_stream_name and a fixed_log_group_name.
I don't thin our config is that advanced either. Im happy to provide our config map if you want it
I tested with fixed stuff and still did not work
exporters:
awsemf:
namespace: ContainerInsights
log_group_name: /aws/containerinsights/testtest/performance
log_retention: 7
log_stream_name: 'testNodeStreamName'
resource_to_telemetry_conversion:
enabled: true
dimension_rollup_option: NoDimensionRollup
parse_json_encoded_attr_values: [Sources, kubernetes]
metric_declarations:
.....
Is it possible that another application is writing to/creating this log group before the collector does? I believe the default behavior is that if the log group already exists it won't overwrite the retention settings.
Can you share the configmap for the collector?
@bryan-aguilar we dont create the log group ourselves, I am testing manually deleting it and then it is automatically recreated. I even tested changing the name in the configmap to prove it was that configmap recreating it for us
Here is our configmap
apiVersion: v1
data:
collector.yaml: |
exporters:
awsemf:
dimension_rollup_option: NoDimensionRollup
log_group_name: /aws/containerinsights/testtest/performance
log_retention: 7
log_stream_name: testNodeStreamName
metric_declarations:
- dimensions:
- - NodeName
- InstanceId
- ClusterName
metric_name_selectors:
- node_cpu_utilization
- node_memory_utilization
- node_network_total_bytes
- node_cpu_reserved_capacity
- node_memory_reserved_capacity
- node_number_of_running_pods
- node_number_of_running_containers
- dimensions:
- - ClusterName
metric_name_selectors:
- node_cpu_utilization
- node_memory_utilization
- node_network_total_bytes
- node_cpu_reserved_capacity
- node_memory_reserved_capacity
- node_number_of_running_pods
- node_number_of_running_containers
- node_cpu_usage_total
- node_cpu_limit
- node_memory_working_set
- node_memory_limit
- dimensions:
- - FullPodName
- Namespace
- ClusterName
- - Service
- FullPodName
- Namespace
- ClusterName
- - Service
- Namespace
- ClusterName
- - Namespace
- ClusterName
- - ClusterName
metric_name_selectors:
- pod_cpu_utilization
- pod_memory_utilization
- pod_network_rx_bytes
- pod_network_tx_bytes
- pod_cpu_utilization_over_pod_limit
- pod_memory_utilization_over_pod_limit
- dimensions:
- - Service
- FullPodName
- Namespace
- ClusterName
- - ClusterName
metric_name_selectors:
- pod_cpu_reserved_capacity
- pod_memory_reserved_capacity
- dimensions:
- - Service
- FullPodName
- Namespace
- ClusterName
metric_name_selectors:
- pod_number_of_container_restarts
- dimensions:
- - ClusterName
metric_name_selectors:
- cluster_node_count
- cluster_failed_node_count
- dimensions:
- - Service
- Namespace
- ClusterName
- - ClusterName
metric_name_selectors:
- service_number_of_running_pods
- dimensions:
- - NodeName
- InstanceId
- ClusterName
- - ClusterName
metric_name_selectors:
- node_filesystem_utilization
- dimensions:
- - Namespace
- ClusterName
- - ClusterName
metric_name_selectors:
- namespace_number_of_running_pods
namespace: ContainerInsights
parse_json_encoded_attr_values:
- Sources
- kubernetes
region: us-east-2
resource_to_telemetry_conversion:
enabled: true
logging:
loglevel: debug
prometheusremotewrite:
auth:
authenticator: sigv4auth
endpoint: xxxxx
external_labels:
astra_environment: xxxxx
cluster_name: xxxxxeks
extensions:
health_check: {}
memory_ballast:
size_mib: 450
pprof:
endpoint: ${MY_POD_IP}:1777
sigv4auth:
assume_role:
sts_region: us-east-2
zpages:
endpoint: ${MY_POD_IP}:55679
processors:
memory_limiter:
check_interval: 2s
limit_mib: 900
spike_limit_mib: 100
resource:
attributes:
- action: insert
from_attribute: job
key: TaskId
- action: insert
key: receiver
value: prometheus
resourcedetection/ec2:
detectors:
- env
override: false
timeout: 2s
receivers:
awscontainerinsightreceiver:
add_full_pod_name_metric_label: true
prefer_full_pod_name: true
otlp:
protocols:
grpc:
endpoint: ${MY_NODE_IP}:4317
prometheus:
config:
global:
scrape_interval: 1m
scrape_timeout: 10s
scrape_configs:
- job_name: kube-state-metrics
static_configs:
- targets:
- kube-state-metrics.kube-system.svc.cluster.local:8080
- job_name: cluster-autoscaler
static_configs:
- targets:
- cluster-autoscaler-aws-cluster-autoscaler.kube-system.svc.cluster.local:8085
- job_name: eso
kubernetes_sd_configs:
- role: pod
metrics_path: /metrics
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scrape
- action: keep
regex: external-secrets
source_labels:
- __meta_kubernetes_pod_container_name
service:
extensions:
- health_check
- zpages
- pprof
- memory_ballast
- sigv4auth
pipelines:
metrics:
exporters:
- awsemf
processors:
- memory_limiter
- resourcedetection/ec2
receivers:
- awscontainerinsightreceiver
metrics/2:
exporters:
- prometheusremotewrite
processors:
- memory_limiter
- resourcedetection/ec2
- resource
receivers:
- prometheus
- otlp
kind: ConfigMap
Thank you for the information. I am trying to reproduce the issue on my side right now and will let you know the results
Thanks! Happy to test out anything too on our end, just let me know
Does the IAM role that is associated with the collector either through IRSA/Node have the requisite permissions? logs:PutRetentionPolicy
?
Does the IAM role that is associated with the collector either through IRSA/Node have the requisite permissions?
logs:PutRetentionPolicy
?
Oh Good thinking - let me go check on that! We def operate Least Priv here so its highly likely it doesn't have it
Yep this must be it, we set
statement {
sid = "AllowADOTPolicy"
actions = [
"aps:RemoteWrite",
"aps:GetSeries",
"aps:GetLabels",
"aps:GetMetricMetadata",
"ec2:DescribeTags",
"ec2:DescribeVolumes",
"logs:PutLogEvents",
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:DescribeLogStreams",
"logs:DescribeLogGroups"
]
resources = ["*"]
}
}```
That looks highly probable! Could you update the policy and retest to let us know?
Yep that was it @bryan-aguilar ! So maybe we just update docs for this perhaps? https://aws-otel.github.io/docs/setup/permissions
I can update the docs to add this policy in the permission setup guide.
I have a PR in flight already.
Doc update PR has been made and I'm going to close this as resolved.
Describe the bug This was fixed by https://github.com/aws-observability/aws-otel-collector/issues/991 But I am not seeing this actually work when trying to implement it
Steps to reproduce Rolled out 0.25.0 of the collector Updated my config map to have something like
Rolled this out, then manually deleted the log group in cloud watch It was recreated, however no log retention was set, still says Never Expire
What did you expect to see? Log retention would be set on the new log group that was created
Can provide more things if necessary Maybe I am not implementing this properly
@bryan-aguilar @humivo
We are using the Daemonset to roll this out with the adot operator I bumped up to image: 'amazon/aws-otel-collector:v0.25.0'