Closed beegmon closed 1 year ago
It looks like this issue may have been in contrib. I am curious when these changes will be pulled into the aws otel agent?
If these changes were recently fixed upstream you can expect them to be pulled into the ADOT Collector release v0.18.0
. Cu
Closing as addressed earlier in the year.
I'm getting this issue now. Is this really fixed? Here's my otel-config with \n replaced by newlines for readability. Note that the \" below are because this was originally a quoted string in the template. Left them here for minimal changes.
receivers:
prometheus:
config:
global:
scrape_interval: 1m
scrape_timeout: 10s
scrape_configs:
- job_name: \"appmesh-envoy\"
sample_limit: 10000
metrics_path: /stats/prometheus
static_configs:
- targets: ['0.0.0.0:9901']
awsecscontainermetrics:
collection_interval: 15s
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:55681
awsxray:
endpoint: 0.0.0.0:2000
transport: udp
statsd:
endpoint: 0.0.0.0:8125
aggregation_interval: 60s
processors:
batch/traces:
timeout: 1s
send_batch_size: 50
batch/metrics:
timeout: 60s
filter:
metrics:
include:
match_type: strict
metric_names:
- ecs.task.memory.utilized
- ecs.task.memory.reserved
- ecs.task.memory.usage
- ecs.task.cpu.utilized
- ecs.task.cpu.reserved
- ecs.task.cpu.usage.vcpu
- ecs.task.network.rate.rx
- ecs.task.network.rate.tx
- ecs.task.storage.read_bytes
- ecs.task.storage.write_bytes
exporters:
awsxray:
region: us-east-1
prometheusremotewrite:
endpoint: ${PrometheusWorkspace.PrometheusEndpoint}api/v1/remote_write
auth:
authenticator: sigv4auth
awsemf:
namespace: ECS/AWSOtel/Application
log_group_name: '/ecs/application/metrics/{ClusterName}'
log_stream_name: '/{TaskDefinitionFamily}/{TaskId}'
resource_to_telemetry_conversion:
enabled: true
dimension_rollup_option: NoDimensionRollup
metric_declarations:
- dimensions: [ [ ClusterName, TaskDefinitionFamily ] ]
metric_name_selectors:
- \"^envoy_http_downstream_rq_(total|xx)$\"
- \"^envoy_cluster_upstream_cx_(r|t)x_bytes_total$\"
- \"^envoy_cluster_membership_(healthy|total)$\"
- \"^envoy_server_memory_(allocated|heap_size)$\"
- \"^envoy_cluster_upstream_cx_(connect_timeout|destroy_local_with_active_rq)$\"
- \"^envoy_cluster_upstream_rq_(pending_failure_eject|pending_overflow|timeout|per_try_timeout|rx_reset|maintenance_mode)$\"
- \"^envoy_http_downstream_cx_destroy_remote_active_rq$\"
- \"^envoy_cluster_upstream_flow_control_(paused_reading_total|resumed_reading_total|backed_up_total|drained_total)$\"
- \"^envoy_cluster_upstream_rq_retry$\"
- \"^envoy_cluster_upstream_rq_retry_(success|overflow)$\"
- \"^envoy_server_(version|uptime|live)$\"
label_matchers:
- label_names:
- container_name
regex: ^envoy$
- dimensions: [ [ ClusterName, TaskDefinitionFamily, envoy_http_conn_manager_prefix, envoy_response_code_class ] ]
metric_name_selectors:
- \"^envoy_http_downstream_rq_xx$\"
label_matchers:
- label_names:
- container_name
regex: ^envoy$
logging:
loglevel: debug
extensions:
health_check:
pprof:
endpoint: :1888
zpages:
endpoint: :55679
service:
extensions: [pprof, zpages, health_check]
pipelines:
metrics:
receivers: [otlp, statsd]
processors: [batch/metrics]
exporters: [logging, prometheusremotewrite, awsemf]
metrics/envoy:
receivers: [prometheus]
processors: [batch/metrics]
exporters: [logging, prometheusremotewrite, awsemf]
metrics/ecs:
receivers: [awsecscontainermetrics]
processors: [filter, batch/metrics]
exporters: [logging, prometheusremotewrite, awsemf]
traces:
receivers: [otlp, awsxray]
processors: [batch/traces]
exporters: [awsxray]
Here's the way it is in my template:
Value: !Sub "receivers: \n prometheus:\n config:\n global:\n scrape_interval: 1m\n scrape_timeout: 10s\n scrape_configs:\n - job_name: \"appmesh-envoy\"\n sample_limit: 10000\n metrics_path: /stats/prometheus\n static_configs:\n - targets: ['0.0.0.0:9901']\n awsecscontainermetrics:\n collection_interval: 15s\n otlp:\n protocols:\n grpc:\n endpoint: 0.0.0.0:4317\n http:\n endpoint: 0.0.0.0:55681\n awsxray:\n endpoint: 0.0.0.0:2000\n transport: udp\n statsd:\n endpoint: 0.0.0.0:8125\n aggregation_interval: 60s\nprocessors:\n batch/traces:\n timeout: 1s\n send_batch_size: 50\n batch/metrics:\n timeout: 60s\n filter:\n metrics:\n include:\n match_type: strict\n metric_names:\n - ecs.task.memory.utilized\n - ecs.task.memory.reserved\n - ecs.task.memory.usage\n - ecs.task.cpu.utilized\n - ecs.task.cpu.reserved\n - ecs.task.cpu.usage.vcpu\n - ecs.task.network.rate.rx\n - ecs.task.network.rate.tx\n - ecs.task.storage.read_bytes\n - ecs.task.storage.write_bytes\nexporters:\n awsxray:\n region: us-east-1\n prometheusremotewrite:\n endpoint: ${PrometheusWorkspace.PrometheusEndpoint}api/v1/remote_write\n auth:\n authenticator: sigv4auth\n awsemf:\n namespace: ECS/AWSOtel/Application\n log_group_name: '/ecs/application/metrics/{ClusterName}'\n log_stream_name: '/{TaskDefinitionFamily}/{TaskId}'\n resource_to_telemetry_conversion:\n enabled: true\n dimension_rollup_option: NoDimensionRollup\n metric_declarations:\n - dimensions: [ [ ClusterName, TaskDefinitionFamily ] ]\n metric_name_selectors:\n - \"^envoy_http_downstream_rq_(total|xx)$\"\n - \"^envoy_cluster_upstream_cx_(r|t)x_bytes_total$\"\n - \"^envoy_cluster_membership_(healthy|total)$\"\n - \"^envoy_server_memory_(allocated|heap_size)$\"\n - \"^envoy_cluster_upstream_cx_(connect_timeout|destroy_local_with_active_rq)$\"\n - \"^envoy_cluster_upstream_rq_(pending_failure_eject|pending_overflow|timeout|per_try_timeout|rx_reset|maintenance_mode)$\"\n - \"^envoy_http_downstream_cx_destroy_remote_active_rq$\"\n - \"^envoy_cluster_upstream_flow_control_(paused_reading_total|resumed_reading_total|backed_up_total|drained_total)$\"\n - \"^envoy_cluster_upstream_rq_retry$\"\n - \"^envoy_cluster_upstream_rq_retry_(success|overflow)$\"\n - \"^envoy_server_(version|uptime|live)$\"\n label_matchers:\n - label_names:\n - container_name\n regex: ^envoy$\n - dimensions: [ [ ClusterName, TaskDefinitionFamily, envoy_http_conn_manager_prefix, envoy_response_code_class ] ]\n metric_name_selectors:\n - \"^envoy_http_downstream_rq_xx$\"\n label_matchers:\n - label_names:\n - container_name\n regex: ^envoy$\n logging:\n loglevel: debug\nextensions:\n health_check:\n pprof:\n endpoint: :1888\n zpages:\n endpoint: :55679\nservice:\n extensions: [pprof, zpages, health_check]\n pipelines:\n metrics:\n receivers: [otlp, statsd]\n processors: [batch/metrics]\n exporters: [logging, prometheusremotewrite, awsemf]\n metrics/envoy:\n receivers: [prometheus]\n processors: [batch/metrics]\n exporters: [logging, prometheusremotewrite, awsemf]\n metrics/ecs:\n receivers: [awsecscontainermetrics]\n processors: [filter, batch/metrics]\n exporters: [logging, prometheusremotewrite, awsemf]\n traces:\n receivers: [otlp, awsxray]\n processors: [batch/traces]\n exporters: [awsxray]\n"
Here's the error with the code and memory address:
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x25529d6]
Any advice would be greatly appreciated.
Thanks @davetbo-amzn for reaching out with above details, appreciate. Reopening so we could review and update.
@davetbo-amzn can you please include the full stack trace that followed the panic?
Here you go. This was the entirety of the fargate/otel/otel-collector* log for this run from CloudWatch. Thanks for taking a look!
2023/02/16 14:46:24 ADOT Collector version: v0.26.1
--
2023/02/16 14:46:24 found no extra config, skip it, err: open /opt/aws/aws-otel-collector/etc/extracfg.txt: no such file or directory
2023/02/16 14:46:24 Reading AOT config from environment: receivers:
prometheus:
config:
global:
scrape_interval: 1m
scrape_timeout: 10s
scrape_configs:
- job_name: "appmesh-envoy"
sample_limit: 10000
metrics_path: /stats/prometheus
static_configs:
- targets: ['0.0.0.0:9901']
awsecscontainermetrics:
collection_interval: 15s
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:55681
awsxray:
endpoint: 0.0.0.0:2000
transport: udp
statsd:
endpoint: 0.0.0.0:8125
aggregation_interval: 60s
processors:
batch/traces:
timeout: 1s
send_batch_size: 50
batch/metrics:
timeout: 60s
filter:
metrics:
include:
match_type: strict
metric_names:
- ecs.task.memory.utilized
- ecs.task.memory.reserved
- ecs.task.memory.usage
- ecs.task.cpu.utilized
- ecs.task.cpu.reserved
- ecs.task.cpu.usage.vcpu
- ecs.task.network.rate.rx
- ecs.task.network.rate.tx
- ecs.task.storage.read_bytes
- ecs.task.storage.write_bytes
exporters:
awsxray:
region: us-east-1
prometheusremotewrite:
endpoint: https://aps-workspaces.us-east-1.amazonaws.com/workspaces/ws-f87b6940-bfcc-4ec4-b9b1-325189711ad5/api/v1/remote_write
auth:
authenticator: sigv4auth
awsemf:
namespace: ECS/AWSOtel/Application
log_group_name: '/ecs/application/metrics/{ClusterName}'
log_stream_name: '/{TaskDefinitionFamily}/{TaskId}'
resource_to_telemetry_conversion:
enabled: true
dimension_rollup_option: NoDimensionRollup
metric_declarations:
- dimensions: [ [ ClusterName, TaskDefinitionFamily ] ]
metric_name_selectors:
- "^envoy_http_downstream_rq_(total\|xx)$"
- "^envoy_cluster_upstream_cx_(r\|t)x_bytes_total$"
- "^envoy_cluster_membership_(healthy\|total)$"
- "^envoy_server_memory_(allocated\|heap_size)$"
- "^envoy_cluster_upstream_cx_(connect_timeout\|destroy_local_with_active_rq)$"
- "^envoy_cluster_upstream_rq_(pending_failure_eject\|pending_overflow\|timeout\|per_try_timeout\|rx_reset\|maintenance_mode)$"
- "^envoy_http_downstream_cx_destroy_remote_active_rq$"
- "^envoy_cluster_upstream_flow_control_(paused_reading_total\|resumed_reading_total\|backed_up_total\|drained_total)$"
- "^envoy_cluster_upstream_rq_retry$"
- "^envoy_cluster_upstream_rq_retry_(success\|overflow)$"
- "^envoy_server_(version\|uptime\|live)$"
label_matchers:
- label_names:
- container_name
regex: ^envoy$
- dimensions: [ [ ClusterName, TaskDefinitionFamily, envoy_http_conn_manager_prefix, envoy_response_code_class ] ]
metric_name_selectors:
- "^envoy_http_downstream_rq_xx$"
label_matchers:
- label_names:
- container_name
regex: ^envoy$
logging:
loglevel: debug
extensions:
health_check:
pprof:
endpoint: :1888
zpages:
endpoint: :55679
service:
extensions: [pprof, zpages, health_check]
pipelines:
metrics:
receivers: [otlp, statsd]
processors: [batch/metrics]
exporters: [logging, prometheusremotewrite, awsemf]
metrics/envoy:
receivers: [prometheus]
processors: [batch/metrics]
exporters: [logging, prometheusremotewrite, awsemf]
metrics/ecs:
receivers: [awsecscontainermetrics]
processors: [filter, batch/metrics]
exporters: [logging, prometheusremotewrite, awsemf]
traces:
receivers: [otlp, awsxray]
processors: [batch/traces]
exporters: [awsxray]
2023-02-16T14:46:24.621Z info service/telemetry.go:90 Setting up own telemetry...
2023-02-16T14:46:24.621Z info service/telemetry.go:116 Serving Prometheus metrics { "address": ":8888", "level": "Basic" }
2023-02-16T14:46:24.630Z info exporter/exporter.go:290 Development component. May change in the future. { "kind": "exporter", "data_type": "metrics", "name": "logging" }
2023-02-16T14:46:24.630Z warn loggingexporter@v0.70.0/factory.go:109 'loglevel' option is deprecated in favor of 'verbosity'. Set 'verbosity' to equivalent value to preserve behavior. { "kind": "exporter", "data_type": "metrics", "name": "logging", "loglevel": "debug", "equivalent verbosity level": "Detailed" }
2023-02-16T14:46:24.637Z info filterprocessor@v0.70.0/metrics.go:97 Metric filter configured { "kind": "processor", "name": "filter", "pipeline": "metrics/ecs", "include match_type": "strict", "include expressions": [], "include metric names": [ "ecs.task.memory.utilized", "ecs.task.memory.reserved", "ecs.task.memory.usage", "ecs.task.cpu.utilized", "ecs.task.cpu.reserved", "ecs.task.cpu.usage.vcpu", "ecs.task.network.rate.rx", "ecs.task.network.rate.tx", "ecs.task.storage.read_bytes", "ecs.task.storage.write_bytes" ], "include metrics with resource attributes": null, "exclude match_type": "", "exclude expressions": [], "exclude metric names": [], "exclude metrics with resource attributes": null }
2023-02-16T14:46:24.638Z info awsxrayreceiver@v0.70.0/receiver.go:58 Going to listen on endpoint for X-Ray segments { "kind": "receiver", "name": "awsxray", "pipeline": "traces", "udp": "0.0.0.0:2000" }
2023-02-16T14:46:24.638Z info udppoller/poller.go:106 Listening on endpoint for X-Ray segments { "kind": "receiver", "name": "awsxray", "pipeline": "traces", "udp": "0.0.0.0:2000" }
2023-02-16T14:46:24.638Z info awsxrayreceiver@v0.70.0/receiver.go:69 Listening on endpoint for X-Ray segments { "kind": "receiver", "name": "awsxray", "pipeline": "traces", "udp": "0.0.0.0:2000" }
2023-02-16T14:46:24.640Z info service/service.go:128 Starting aws-otel-collector... { "Version": "v0.26.1", "NumCPU": 2 }
2023-02-16T14:46:24.640Z info extensions/extensions.go:41 Starting extensions...
2023-02-16T14:46:24.640Z info extensions/extensions.go:44 Extension is starting... { "kind": "extension", "name": "zpages" }
2023-02-16T14:46:24.640Z info zpagesextension@v0.70.0/zpagesextension.go:64 Registered zPages span processor on tracer provider { "kind": "extension", "name": "zpages" }
2023-02-16T14:46:24.640Z info zpagesextension@v0.70.0/zpagesextension.go:74 Registered Host's zPages { "kind": "extension", "name": "zpages" }
2023-02-16T14:46:24.640Z info zpagesextension@v0.70.0/zpagesextension.go:86 Starting zPages extension { "kind": "extension", "name": "zpages", "config": { "TCPAddr": { "Endpoint": ":55679" } } }
2023-02-16T14:46:24.640Z info extensions/extensions.go:48 Extension started. { "kind": "extension", "name": "zpages" }
2023-02-16T14:46:24.640Z info extensions/extensions.go:44 Extension is starting... { "kind": "extension", "name": "health_check" }
2023-02-16T14:46:24.640Z info healthcheckextension@v0.70.0/healthcheckextension.go:45 Starting health_check extension { "kind": "extension", "name": "health_check", "config": { "Endpoint": "0.0.0.0:13133", "TLSSetting": null, "CORS": null, "Auth": null, "MaxRequestBodySize": 0, "IncludeMetadata": false, "Path": "/", "CheckCollectorPipeline": { "Enabled": false, "Interval": "5m", "ExporterFailureThreshold": 5 } } }
2023-02-16T14:46:24.641Z warn internal/warning.go:51 Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks { "kind": "extension", "name": "health_check", "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks" }
2023-02-16T14:46:24.641Z info extensions/extensions.go:48 Extension started. { "kind": "extension", "name": "health_check" }
2023-02-16T14:46:24.641Z info extensions/extensions.go:44 Extension is starting... { "kind": "extension", "name": "pprof" }
2023-02-16T14:46:24.641Z info pprofextension@v0.70.0/pprofextension.go:71 Starting net/http/pprof server { "kind": "extension", "name": "pprof", "config": { "TCPAddr": { "Endpoint": ":1888" }, "BlockProfileFraction": 0, "MutexProfileFraction": 0, "SaveToFile": "" } }
2023-02-16T14:46:24.641Z info extensions/extensions.go:48 Extension started. { "kind": "extension", "name": "pprof" }
2023-02-16T14:46:24.641Z info service/pipelines.go:86 Starting exporters...
2023-02-16T14:46:24.641Z info service/pipelines.go:90 Exporter is starting... { "kind": "exporter", "data_type": "traces", "name": "awsxray" }
2023-02-16T14:46:24.641Z info service/pipelines.go:94 Exporter started. { "kind": "exporter", "data_type": "traces", "name": "awsxray" }
2023-02-16T14:46:24.641Z info service/pipelines.go:90 Exporter is starting... { "kind": "exporter", "data_type": "metrics", "name": "logging" }
2023-02-16T14:46:24.641Z info service/pipelines.go:94 Exporter started. { "kind": "exporter", "data_type": "metrics", "name": "logging" }
2023-02-16T14:46:24.641Z info service/pipelines.go:90 Exporter is starting... { "kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite" }
2023-02-16T14:46:24.641Z info service/service.go:154 Starting shutdown...
2023-02-16T14:46:24.641Z info healthcheck/handler.go:129 Health Check state change { "kind": "extension", "name": "health_check", "status": "unavailable" }
2023-02-16T14:46:24.641Z info service/pipelines.go:130 Stopping receivers...
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x25529d6]
goroutine 1 [running]:
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awsecscontainermetricsreceiver.(*awsEcsContainerMetricsReceiver).Shutdown(0x0?, {0x3de05ee?, 0x75e419?})
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awsecscontainermetricsreceiver@v0.70.0/receiver.go:82 +0x16
go.opentelemetry.io/collector/service.(*builtPipelines).ShutdownAll(0xc00082cc80, {0x44fa0b0, 0xc000126000})
go.opentelemetry.io/collector@v0.70.0/service/pipelines.go:133 +0x499
go.opentelemetry.io/collector/service.(*Service).Shutdown(0xc00052a000, {0x44fa0b0, 0xc000126000})
go.opentelemetry.io/collector@v0.70.0/service/service.go:160 +0xd9
go.opentelemetry.io/collector/otelcol.(*Collector).setupConfigurationComponents(0xc000b9f980, {0x44fa0b0, 0xc000126000})
go.opentelemetry.io/collector@v0.70.0/otelcol/collector.go:181 +0x5a8
go.opentelemetry.io/collector/otelcol.(*Collector).Run(0xc000b9f980, {0x44fa0b0, 0xc000126000})
go.opentelemetry.io/collector@v0.70.0/otelcol/collector.go:205 +0x65
main.newCommand.func1(0xc00020e900, {0x3db707e?, 0x1?, 0x1?})
github.com/aws-observability/aws-otel-collector/cmd/awscollector/main.go:122 +0x267
github.com/spf13/cobra.(*Command).execute(0xc00020e900, {0xc000122010, 0x1, 0x1})
github.com/spf13/cobra@v1.6.1/command.go:916 +0x862
github.com/spf13/cobra.(*Command).ExecuteC(0xc00020e900)
github.com/spf13/cobra@v1.6.1/command.go:1044 +0x3bd
github.com/spf13/cobra.(*Command).Execute(...)
github.com/spf13/cobra@v1.6.1/command.go:968
main.runInteractive({{0xc00091e000, 0xc00091e1e0, 0xc00091e210, 0xc00074ffb0, 0x0}, {{0x3dd5ef8, 0x12}, {0x3dd480c, 0x12}, {0x44b6870, ...}}, ...})
github.com/aws-observability/aws-otel-collector/cmd/awscollector/main.go:84 +0x5e
main.run({{0xc00091e000, 0xc00091e1e0, 0xc00091e210, 0xc00074ffb0, 0x0}, {{0x3dd5ef8, 0x12}, {0x3dd480c, 0x12}, {0x44b6870, ...}}, ...})
github.com/aws-observability/aws-otel-collector/cmd/awscollector/main_others.go:42 +0xf8
main.main()
github.com/aws-observability/aws-otel-collector/cmd/awscollector/main.go:77 +0x2be
@davetbo-amzn thanks for the report! I have filed a PR upstream to fix this. I'll leave this open until I can confidently say what version of the ADOT Collector the fix will be a part of.
Thanks for the quick response, @bryan-aguilar! Is this something different than the SIGSEV that was originally in this thread? Might it be that my stack somehow pulled an old version of the collector? This was part of a Proton workshop so I'm not completely familiar with how they set it up.
If it's possible I have an old version, how would I check my version?
This config works:
Value: !Sub "receivers: \n prometheus:\n config:\n global:\n scrape_interval: 1m\n scrape_timeout: 10s\n scrape_configs:\n - job_name: \"appmesh-envoy\"\n sample_limit: 10000\n metrics_path: /stats/prometheus\n static_configs:\n - targets: ['0.0.0.0:9901']\n awsecscontainermetrics:\n collection_interval: 15s\n otlp:\n protocols:\n grpc:\n endpoint: 0.0.0.0:4317\n http:\n endpoint: 0.0.0.0:55681\n awsxray:\n endpoint: 0.0.0.0:2000\n transport: udp\n statsd:\n endpoint: 0.0.0.0:8125\n aggregation_interval: 60s\nprocessors:\n batch/traces:\n timeout: 1s\n send_batch_size: 50\n batch/metrics:\n timeout: 60s\n filter:\n metrics:\n include:\n match_type: strict\n metric_names:\n - ecs.task.memory.utilized\n - ecs.task.memory.reserved\n - ecs.task.memory.usage\n - ecs.task.cpu.utilized\n - ecs.task.cpu.reserved\n - ecs.task.cpu.usage.vcpu\n - ecs.task.network.rate.rx\n - ecs.task.network.rate.tx\n - ecs.task.storage.read_bytes\n - ecs.task.storage.write_bytes\nexporters:\n awsxray:\n prometheusremotewrite:\n endpoint: ${PrometheusWorkspace.PrometheusEndpoint}api/v1/remote_write\n resource_to_telemetry_conversion:\n enabled: true\n awsemf:\n namespace: ECS/AWSOtel/Application\n log_group_name: '/ecs/application/metrics/{ClusterName}'\n log_stream_name: '/{TaskDefinitionFamily}/{TaskId}'\n resource_to_telemetry_conversion:\n enabled: true\n dimension_rollup_option: NoDimensionRollup\n metric_declarations:\n - dimensions: [ [ ClusterName, TaskDefinitionFamily ] ]\n metric_name_selectors:\n - \"^envoy_http_downstream_rq_(total|xx)$\"\n - \"^envoy_cluster_upstream_cx_(r|t)x_bytes_total$\"\n - \"^envoy_cluster_membership_(healthy|total)$\"\n - \"^envoy_server_memory_(allocated|heap_size)$\"\n - \"^envoy_cluster_upstream_cx_(connect_timeout|destroy_local_with_active_rq)$\"\n - \"^envoy_cluster_upstream_rq_(pending_failure_eject|pending_overflow|timeout|per_try_timeout|rx_reset|maintenance_mode)$\"\n - \"^envoy_http_downstream_cx_destroy_remote_active_rq$\"\n - \"^envoy_cluster_upstream_flow_control_(paused_reading_total|resumed_reading_total|backed_up_total|drained_total)$\"\n - \"^envoy_cluster_upstream_rq_retry$\"\n - \"^envoy_cluster_upstream_rq_retry_(success|overflow)$\"\n - \"^envoy_server_(version|uptime|live)$\"\n label_matchers:\n - label_names:\n - container_name\n regex: ^envoy$\n - dimensions: [ [ ClusterName, TaskDefinitionFamily, envoy_http_conn_manager_prefix, envoy_response_code_class ] ]\n metric_name_selectors:\n - \"^envoy_http_downstream_rq_xx$\"\n label_matchers:\n - label_names:\n - container_name\n regex: ^envoy$\n logging:\n loglevel: debug\nextensions:\n health_check:\n pprof:\n endpoint: :1888\n zpages:\n endpoint: :55679\nservice:\n extensions: [pprof, zpages, health_check]\n pipelines:\n metrics:\n receivers: [otlp, statsd]\n processors: [batch/metrics]\n exporters: [logging, prometheusremotewrite, awsemf]\n metrics/envoy:\n receivers: [prometheus]\n processors: [batch/metrics]\n exporters: [logging, prometheusremotewrite, awsemf]\n metrics/ecs:\n receivers: [awsecscontainermetrics]\n processors: [filter, batch/metrics]\n exporters: [logging, prometheusremotewrite, awsemf]\n traces:\n receivers: [otlp, awsxray]\n processors: [batch/traces]\n exporters: [awsxray]\n"
Or presented with the \n turned into newlines:
receivers:
prometheus:
config:
global:
scrape_interval: 1m
scrape_timeout: 10s
scrape_configs:
- job_name: \"appmesh-envoy\"
sample_limit: 10000
metrics_path: /stats/prometheus
static_configs:
- targets: ['0.0.0.0:9901']
awsecscontainermetrics:
collection_interval: 15s
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:55681
awsxray:
endpoint: 0.0.0.0:2000
transport: udp
statsd:
endpoint: 0.0.0.0:8125
aggregation_interval: 60s
processors:
batch/traces:
timeout: 1s
send_batch_size: 50
batch/metrics:
timeout: 60s
filter:
metrics:
include:
match_type: strict
metric_names:
- ecs.task.memory.utilized
- ecs.task.memory.reserved
- ecs.task.memory.usage
- ecs.task.cpu.utilized
- ecs.task.cpu.reserved
- ecs.task.cpu.usage.vcpu
- ecs.task.network.rate.rx
- ecs.task.network.rate.tx
- ecs.task.storage.read_bytes
- ecs.task.storage.write_bytes
exporters:
awsxray:
prometheusremotewrite:
endpoint: ${PrometheusWorkspace.PrometheusEndpoint}api/v1/remote_write
resource_to_telemetry_conversion:
enabled: true
awsemf:
namespace: ECS/AWSOtel/Application
log_group_name: '/ecs/application/metrics/{ClusterName}'
log_stream_name: '/{TaskDefinitionFamily}/{TaskId}'
resource_to_telemetry_conversion:
enabled: true
dimension_rollup_option: NoDimensionRollup
metric_declarations:
- dimensions: [ [ ClusterName, TaskDefinitionFamily ] ]
metric_name_selectors:
- \"^envoy_http_downstream_rq_(total|xx)$\"
- \"^envoy_cluster_upstream_cx_(r|t)x_bytes_total$\"
- \"^envoy_cluster_membership_(healthy|total)$\"
- \"^envoy_server_memory_(allocated|heap_size)$\"
- \"^envoy_cluster_upstream_cx_(connect_timeout|destroy_local_with_active_rq)$\"
- \"^envoy_cluster_upstream_rq_(pending_failure_eject|pending_overflow|timeout|per_try_timeout|rx_reset|maintenance_mode)$\"
- \"^envoy_http_downstream_cx_destroy_remote_active_rq$\"
- \"^envoy_cluster_upstream_flow_control_(paused_reading_total|resumed_reading_total|backed_up_total|drained_total)$\"
- \"^envoy_cluster_upstream_rq_retry$\"
- \"^envoy_cluster_upstream_rq_retry_(success|overflow)$\"
- \"^envoy_server_(version|uptime|live)$\"
label_matchers:
- label_names:
- container_name
regex: ^envoy$
- dimensions: [ [ ClusterName, TaskDefinitionFamily, envoy_http_conn_manager_prefix, envoy_response_code_class ] ]
metric_name_selectors:
- \"^envoy_http_downstream_rq_xx$\"
label_matchers:
- label_names:
- container_name
regex: ^envoy$
logging:
loglevel: debug
extensions:
health_check:
pprof:
endpoint: :1888
zpages:
endpoint: :55679
service:
extensions: [pprof, zpages, health_check]
pipelines:
metrics:
receivers: [otlp, statsd]
processors: [batch/metrics]
exporters: [logging, prometheusremotewrite, awsemf]
metrics/envoy:
receivers: [prometheus]
processors: [batch/metrics]
exporters: [logging, prometheusremotewrite, awsemf]
metrics/ecs:
receivers: [awsecscontainermetrics]
processors: [filter, batch/metrics]
exporters: [logging, prometheusremotewrite, awsemf]
traces:
receivers: [otlp, awsxray]
processors: [batch/traces]
exporters: [awsxray]
Here's the diff:
diff old.yml new.yml
50d49
< region: us-east-1
53,54c52,53
< auth:
< authenticator: sigv4auth
---
> resource_to_telemetry_conversion:
> enabled: true
113a113
> "
The sigsegv you reported was due to an unchecked nil value in the shutdown process of awsecscontainermetrics
receiver. The original report was an error in prometheus
receiver. They do not appear related other than both being segmentation faults.
Thanks for the quick response, @bryan-aguilar! Is this something different than the SIGSEV that was originally in this thread? Might it be that my stack somehow pulled an old version of the collector? This was part of a Proton workshop so I'm not completely familiar with how they set it up.
Yes, this was a different issue. Or, rather, a different instance of the same class of issue. The original report related to metric adjustment in the Prometheus receiver that failed to check whether a pointer was nil
prior to using it. Your issue related to shutdown of the awscontainerinsights
receiver failing to check whether a function pointer was nil
prior to using it.
If it's possible I have an old version, how would I check my version?
You can see your version in the logs:
2023-02-16T14:46:24.640Z info service/service.go:128 Starting aws-otel-collector... { "Version": "v0.26.1", "NumCPU": 2 }
Thanks for the quick responses, all!
@davetbo-amzn we have released ADOT collector (https://aws-otel.github.io/docs/ReleaseBlogs/aws-distro-for-opentelemetry-collector-v0.27.0) earlier this week addressing this issue.
That seems to have resolved the error. Thanks!
Describe the bug A Panic SIGSEGV is produced during normal operation
Steps to reproduce During the normal process of operation, a SEGFAULT and panic is produce causes the OTEL agent to crash. Collector is deployed as a sidecar in an ECS EC2 task, running ECS optimized AWS Linux 2 on ARM64 hardware.
CONFIG (VIA ENV VAR FROM PARAMETER STORE): receivers: prometheus: config: global: scrape_interval: 10s scrape_timeout: 5s scrape_configs:
processors: filter: metrics: include: match_type: strict metric_names:
exporters: awsprometheusremotewrite: endpoint:"
aws_auth:
region: "us-west-2"
service: "aps"
resource_to_telemetry_conversion:
enabled: true
logging:
loglevel: debug
extensions:
health_check:
pprof:
endpoint: :1888
zpages:
endpoint: :55679
service: extensions: [pprof, zpages, health_check] pipelines: metrics: receivers: [prometheus] exporters: [logging, awsprometheusremotewrite] metrics/ecs: receivers: [awsecscontainermetrics] processors: [filter] exporters: [logging, awsprometheusremotewrite]
What did you expect to see? I expect the process not the SEGFAULT or panic during normal operation
What did you see instead? LogOutput: panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x2456ba0] goroutine 143 [running]: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver/internal.(MetricsAdjusterPdata).adjustMetricSummary(0x400032ba10, 0x4000412fd0) github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver@v0.43.0/internal/otlp_metrics_adjuster.go:455 +0x130 github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver/internal.(MetricsAdjusterPdata).adjustMetricPoints(0x400032ba10, 0x4000412fd0) github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver@v0.43.0/internal/otlp_metrics_adjuster.go:283 +0x304 github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver/internal.(MetricsAdjusterPdata).adjustMetric(0x400032ba10, 0x4000412fd0) github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver@v0.43.0/internal/otlp_metrics_adjuster.go:269 +0x134 github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver/internal.(MetricsAdjusterPdata).AdjustMetricSlice(0x400032ba10, 0x4001138600) github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver@v0.43.0/internal/otlp_metrics_adjuster.go:235 +0x80 github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver/internal.(transactionPdata).Commit(0x400074e1c0) github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver@v0.43.0/internal/otlp_transaction.go:150 +0x208 github.com/prometheus/prometheus/scrape.(scrapeLoop).scrapeAndReport.func1(0x400032bd08, 0x400032bd18, 0x400073b040) github.com/prometheus/prometheus@v1.8.2-0.20220111145625-076109fa1910/scrape/scrape.go:1250 +0x40 github.com/prometheus/prometheus/scrape.(scrapeLoop).scrapeAndReport(0x400073b040, {0xc07ce8b413ee0de3, 0x15fbf849f5, 0x54ed800}, {0x13f51c5f, 0xed9a5225a, 0x54ed800}, 0x0) github.com/prometheus/prometheus@v1.8.2-0.20220111145625-076109fa1910/scrape/scrape.go:1321 +0xe0c github.com/prometheus/prometheus/scrape.(scrapeLoop).run(0x400073b040, 0x0) github.com/prometheus/prometheus@v1.8.2-0.20220111145625-076109fa1910/scrape/scrape.go:1203 +0x2d0 created by github.com/prometheus/prometheus/scrape.(*scrapePool).sync github.com/prometheus/prometheus@v1.8.2-0.20220111145625-076109fa1910/scrape/scrape.go:584 +0x8f8
Environment Collector is running in AWS as a sidecar within a task, on ECS optimized AWS Linux 2 with ARM64 host
Additional context This doesn't happen immediately, only after 10 min or so of run time.