aws-observability / aws-otel-collector

AWS Distro for OpenTelemetry Collector (see ADOT Roadmap at https://github.com/orgs/aws-observability/projects/4)
https://aws-otel.github.io/
Other
563 stars 237 forks source link

Panic and SIGSEGV #982

Closed beegmon closed 1 year ago

beegmon commented 2 years ago

Describe the bug A Panic SIGSEGV is produced during normal operation

Steps to reproduce During the normal process of operation, a SEGFAULT and panic is produce causes the OTEL agent to crash. Collector is deployed as a sidecar in an ECS EC2 task, running ECS optimized AWS Linux 2 on ARM64 hardware.

CONFIG (VIA ENV VAR FROM PARAMETER STORE): receivers: prometheus: config: global: scrape_interval: 10s scrape_timeout: 5s scrape_configs:

processors: filter: metrics: include: match_type: strict metric_names:

exporters: awsprometheusremotewrite: endpoint: " aws_auth: region: "us-west-2" service: "aps" resource_to_telemetry_conversion: enabled: true logging: loglevel: debug extensions: health_check: pprof: endpoint: :1888 zpages: endpoint: :55679

service: extensions: [pprof, zpages, health_check] pipelines: metrics: receivers: [prometheus] exporters: [logging, awsprometheusremotewrite] metrics/ecs: receivers: [awsecscontainermetrics] processors: [filter] exporters: [logging, awsprometheusremotewrite]

What did you expect to see? I expect the process not the SEGFAULT or panic during normal operation

What did you see instead? LogOutput: panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x2456ba0] goroutine 143 [running]: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver/internal.(MetricsAdjusterPdata).adjustMetricSummary(0x400032ba10, 0x4000412fd0) github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver@v0.43.0/internal/otlp_metrics_adjuster.go:455 +0x130 github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver/internal.(MetricsAdjusterPdata).adjustMetricPoints(0x400032ba10, 0x4000412fd0) github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver@v0.43.0/internal/otlp_metrics_adjuster.go:283 +0x304 github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver/internal.(MetricsAdjusterPdata).adjustMetric(0x400032ba10, 0x4000412fd0) github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver@v0.43.0/internal/otlp_metrics_adjuster.go:269 +0x134 github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver/internal.(MetricsAdjusterPdata).AdjustMetricSlice(0x400032ba10, 0x4001138600) github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver@v0.43.0/internal/otlp_metrics_adjuster.go:235 +0x80 github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver/internal.(transactionPdata).Commit(0x400074e1c0) github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver@v0.43.0/internal/otlp_transaction.go:150 +0x208 github.com/prometheus/prometheus/scrape.(scrapeLoop).scrapeAndReport.func1(0x400032bd08, 0x400032bd18, 0x400073b040) github.com/prometheus/prometheus@v1.8.2-0.20220111145625-076109fa1910/scrape/scrape.go:1250 +0x40 github.com/prometheus/prometheus/scrape.(scrapeLoop).scrapeAndReport(0x400073b040, {0xc07ce8b413ee0de3, 0x15fbf849f5, 0x54ed800}, {0x13f51c5f, 0xed9a5225a, 0x54ed800}, 0x0) github.com/prometheus/prometheus@v1.8.2-0.20220111145625-076109fa1910/scrape/scrape.go:1321 +0xe0c github.com/prometheus/prometheus/scrape.(scrapeLoop).run(0x400073b040, 0x0) github.com/prometheus/prometheus@v1.8.2-0.20220111145625-076109fa1910/scrape/scrape.go:1203 +0x2d0 created by github.com/prometheus/prometheus/scrape.(*scrapePool).sync github.com/prometheus/prometheus@v1.8.2-0.20220111145625-076109fa1910/scrape/scrape.go:584 +0x8f8

Environment Collector is running in AWS as a sidecar within a task, on ECS optimized AWS Linux 2 with ARM64 host

Additional context This doesn't happen immediately, only after 10 min or so of run time.

beegmon commented 2 years ago

It looks like this issue may have been in contrib. I am curious when these changes will be pulled into the aws otel agent?

bryan-aguilar commented 2 years ago

If these changes were recently fixed upstream you can expect them to be pulled into the ADOT Collector release v0.18.0. Cu

vsakaram commented 1 year ago

Closing as addressed earlier in the year.

davetbo-amzn commented 1 year ago

I'm getting this issue now. Is this really fixed? Here's my otel-config with \n replaced by newlines for readability. Note that the \" below are because this was originally a quoted string in the template. Left them here for minimal changes.

receivers:  
  prometheus:
    config:
      global:
        scrape_interval: 1m
        scrape_timeout: 10s
      scrape_configs:
      - job_name: \"appmesh-envoy\"
        sample_limit: 10000
        metrics_path: /stats/prometheus
        static_configs:
          - targets: ['0.0.0.0:9901']
  awsecscontainermetrics:
    collection_interval: 15s
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:55681
  awsxray:
    endpoint: 0.0.0.0:2000
    transport: udp
  statsd:
    endpoint: 0.0.0.0:8125
    aggregation_interval: 60s
processors:
  batch/traces:
    timeout: 1s
    send_batch_size: 50
  batch/metrics:
    timeout: 60s
  filter:
    metrics:
      include:
        match_type: strict
        metric_names:
          - ecs.task.memory.utilized
          - ecs.task.memory.reserved
          - ecs.task.memory.usage
          - ecs.task.cpu.utilized
          - ecs.task.cpu.reserved
          - ecs.task.cpu.usage.vcpu
          - ecs.task.network.rate.rx
          - ecs.task.network.rate.tx
          - ecs.task.storage.read_bytes
          - ecs.task.storage.write_bytes
exporters:
  awsxray:
    region: us-east-1
  prometheusremotewrite:
    endpoint: ${PrometheusWorkspace.PrometheusEndpoint}api/v1/remote_write
    auth:
      authenticator: sigv4auth
  awsemf:
    namespace: ECS/AWSOtel/Application
    log_group_name: '/ecs/application/metrics/{ClusterName}'
    log_stream_name: '/{TaskDefinitionFamily}/{TaskId}'
    resource_to_telemetry_conversion:
      enabled: true
    dimension_rollup_option: NoDimensionRollup
    metric_declarations:
      - dimensions: [ [ ClusterName, TaskDefinitionFamily ] ]
        metric_name_selectors:
          - \"^envoy_http_downstream_rq_(total|xx)$\"
          - \"^envoy_cluster_upstream_cx_(r|t)x_bytes_total$\"
          - \"^envoy_cluster_membership_(healthy|total)$\"
          - \"^envoy_server_memory_(allocated|heap_size)$\"
          - \"^envoy_cluster_upstream_cx_(connect_timeout|destroy_local_with_active_rq)$\"
          - \"^envoy_cluster_upstream_rq_(pending_failure_eject|pending_overflow|timeout|per_try_timeout|rx_reset|maintenance_mode)$\"
          - \"^envoy_http_downstream_cx_destroy_remote_active_rq$\"
          - \"^envoy_cluster_upstream_flow_control_(paused_reading_total|resumed_reading_total|backed_up_total|drained_total)$\"
          - \"^envoy_cluster_upstream_rq_retry$\"
          - \"^envoy_cluster_upstream_rq_retry_(success|overflow)$\"
          - \"^envoy_server_(version|uptime|live)$\"
        label_matchers:
          - label_names:
              - container_name
            regex: ^envoy$
      - dimensions: [ [ ClusterName, TaskDefinitionFamily, envoy_http_conn_manager_prefix, envoy_response_code_class ] ]
        metric_name_selectors:
          - \"^envoy_http_downstream_rq_xx$\"
        label_matchers:
          - label_names:
              - container_name
            regex: ^envoy$
  logging:
    loglevel: debug
extensions:
  health_check:
  pprof:
    endpoint: :1888
  zpages:
    endpoint: :55679
service:
  extensions: [pprof, zpages, health_check]
  pipelines:
    metrics:
      receivers: [otlp, statsd]
      processors: [batch/metrics]
      exporters: [logging, prometheusremotewrite, awsemf]
    metrics/envoy:
      receivers: [prometheus]
      processors: [batch/metrics]
      exporters: [logging, prometheusremotewrite, awsemf]
    metrics/ecs:
      receivers: [awsecscontainermetrics]
      processors: [filter, batch/metrics]
      exporters: [logging, prometheusremotewrite, awsemf]
    traces:
      receivers: [otlp, awsxray]
      processors: [batch/traces]
      exporters: [awsxray]

Here's the way it is in my template:

      Value: !Sub "receivers:  \n  prometheus:\n    config:\n      global:\n        scrape_interval: 1m\n        scrape_timeout: 10s\n      scrape_configs:\n      - job_name: \"appmesh-envoy\"\n        sample_limit: 10000\n        metrics_path: /stats/prometheus\n        static_configs:\n          - targets: ['0.0.0.0:9901']\n  awsecscontainermetrics:\n    collection_interval: 15s\n  otlp:\n    protocols:\n      grpc:\n        endpoint: 0.0.0.0:4317\n      http:\n        endpoint: 0.0.0.0:55681\n  awsxray:\n    endpoint: 0.0.0.0:2000\n    transport: udp\n  statsd:\n    endpoint: 0.0.0.0:8125\n    aggregation_interval: 60s\nprocessors:\n  batch/traces:\n    timeout: 1s\n    send_batch_size: 50\n  batch/metrics:\n    timeout: 60s\n  filter:\n    metrics:\n      include:\n        match_type: strict\n        metric_names:\n          - ecs.task.memory.utilized\n          - ecs.task.memory.reserved\n          - ecs.task.memory.usage\n          - ecs.task.cpu.utilized\n          - ecs.task.cpu.reserved\n          - ecs.task.cpu.usage.vcpu\n          - ecs.task.network.rate.rx\n          - ecs.task.network.rate.tx\n          - ecs.task.storage.read_bytes\n          - ecs.task.storage.write_bytes\nexporters:\n  awsxray:\n    region: us-east-1\n  prometheusremotewrite:\n    endpoint: ${PrometheusWorkspace.PrometheusEndpoint}api/v1/remote_write\n    auth:\n      authenticator: sigv4auth\n  awsemf:\n    namespace: ECS/AWSOtel/Application\n    log_group_name: '/ecs/application/metrics/{ClusterName}'\n    log_stream_name: '/{TaskDefinitionFamily}/{TaskId}'\n    resource_to_telemetry_conversion:\n      enabled: true\n    dimension_rollup_option: NoDimensionRollup\n    metric_declarations:\n      - dimensions: [ [ ClusterName, TaskDefinitionFamily ] ]\n        metric_name_selectors:\n          - \"^envoy_http_downstream_rq_(total|xx)$\"\n          - \"^envoy_cluster_upstream_cx_(r|t)x_bytes_total$\"\n          - \"^envoy_cluster_membership_(healthy|total)$\"\n          - \"^envoy_server_memory_(allocated|heap_size)$\"\n          - \"^envoy_cluster_upstream_cx_(connect_timeout|destroy_local_with_active_rq)$\"\n          - \"^envoy_cluster_upstream_rq_(pending_failure_eject|pending_overflow|timeout|per_try_timeout|rx_reset|maintenance_mode)$\"\n          - \"^envoy_http_downstream_cx_destroy_remote_active_rq$\"\n          - \"^envoy_cluster_upstream_flow_control_(paused_reading_total|resumed_reading_total|backed_up_total|drained_total)$\"\n          - \"^envoy_cluster_upstream_rq_retry$\"\n          - \"^envoy_cluster_upstream_rq_retry_(success|overflow)$\"\n          - \"^envoy_server_(version|uptime|live)$\"\n        label_matchers:\n          - label_names:\n              - container_name\n            regex: ^envoy$\n      - dimensions: [ [ ClusterName, TaskDefinitionFamily, envoy_http_conn_manager_prefix, envoy_response_code_class ] ]\n        metric_name_selectors:\n          - \"^envoy_http_downstream_rq_xx$\"\n        label_matchers:\n          - label_names:\n              - container_name\n            regex: ^envoy$\n  logging:\n    loglevel: debug\nextensions:\n  health_check:\n  pprof:\n    endpoint: :1888\n  zpages:\n    endpoint: :55679\nservice:\n  extensions: [pprof, zpages, health_check]\n  pipelines:\n    metrics:\n      receivers: [otlp, statsd]\n      processors: [batch/metrics]\n      exporters: [logging, prometheusremotewrite, awsemf]\n    metrics/envoy:\n      receivers: [prometheus]\n      processors: [batch/metrics]\n      exporters: [logging, prometheusremotewrite, awsemf]\n    metrics/ecs:\n      receivers: [awsecscontainermetrics]\n      processors: [filter, batch/metrics]\n      exporters: [logging, prometheusremotewrite, awsemf]\n    traces:\n      receivers: [otlp, awsxray]\n      processors: [batch/traces]\n      exporters: [awsxray]\n"

Here's the error with the code and memory address:

panic: runtime error: invalid memory address or nil pointer dereference 
 [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x25529d6]

Any advice would be greatly appreciated.

vsakaram commented 1 year ago

Thanks @davetbo-amzn for reaching out with above details, appreciate. Reopening so we could review and update.

Aneurysm9 commented 1 year ago

@davetbo-amzn can you please include the full stack trace that followed the panic?

davetbo-amzn commented 1 year ago

Here you go. This was the entirety of the fargate/otel/otel-collector* log for this run from CloudWatch. Thanks for taking a look!


2023/02/16 14:46:24 ADOT Collector version: v0.26.1
--
2023/02/16 14:46:24 found no extra config, skip it, err: open /opt/aws/aws-otel-collector/etc/extracfg.txt: no such file or directory
2023/02/16 14:46:24 Reading AOT config from environment: receivers:
prometheus:
config:
global:
scrape_interval: 1m
scrape_timeout: 10s
scrape_configs:
- job_name: "appmesh-envoy"
sample_limit: 10000
metrics_path: /stats/prometheus
static_configs:
- targets: ['0.0.0.0:9901']
awsecscontainermetrics:
collection_interval: 15s
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:55681
awsxray:
endpoint: 0.0.0.0:2000
transport: udp
statsd:
endpoint: 0.0.0.0:8125
aggregation_interval: 60s
processors:
batch/traces:
timeout: 1s
send_batch_size: 50
batch/metrics:
timeout: 60s
filter:
metrics:
include:
match_type: strict
metric_names:
- ecs.task.memory.utilized
- ecs.task.memory.reserved
- ecs.task.memory.usage
- ecs.task.cpu.utilized
- ecs.task.cpu.reserved
- ecs.task.cpu.usage.vcpu
- ecs.task.network.rate.rx
- ecs.task.network.rate.tx
- ecs.task.storage.read_bytes
- ecs.task.storage.write_bytes
exporters:
awsxray:
region: us-east-1
prometheusremotewrite:
endpoint: https://aps-workspaces.us-east-1.amazonaws.com/workspaces/ws-f87b6940-bfcc-4ec4-b9b1-325189711ad5/api/v1/remote_write
auth:
authenticator: sigv4auth
awsemf:
namespace: ECS/AWSOtel/Application
log_group_name: '/ecs/application/metrics/{ClusterName}'
log_stream_name: '/{TaskDefinitionFamily}/{TaskId}'
resource_to_telemetry_conversion:
enabled: true
dimension_rollup_option: NoDimensionRollup
metric_declarations:
- dimensions: [ [ ClusterName, TaskDefinitionFamily ] ]
metric_name_selectors:
- "^envoy_http_downstream_rq_(total\|xx)$"
- "^envoy_cluster_upstream_cx_(r\|t)x_bytes_total$"
- "^envoy_cluster_membership_(healthy\|total)$"
- "^envoy_server_memory_(allocated\|heap_size)$"
- "^envoy_cluster_upstream_cx_(connect_timeout\|destroy_local_with_active_rq)$"
- "^envoy_cluster_upstream_rq_(pending_failure_eject\|pending_overflow\|timeout\|per_try_timeout\|rx_reset\|maintenance_mode)$"
- "^envoy_http_downstream_cx_destroy_remote_active_rq$"
- "^envoy_cluster_upstream_flow_control_(paused_reading_total\|resumed_reading_total\|backed_up_total\|drained_total)$"
- "^envoy_cluster_upstream_rq_retry$"
- "^envoy_cluster_upstream_rq_retry_(success\|overflow)$"
- "^envoy_server_(version\|uptime\|live)$"
label_matchers:
- label_names:
- container_name
regex: ^envoy$
- dimensions: [ [ ClusterName, TaskDefinitionFamily, envoy_http_conn_manager_prefix, envoy_response_code_class ] ]
metric_name_selectors:
- "^envoy_http_downstream_rq_xx$"
label_matchers:
- label_names:
- container_name
regex: ^envoy$
logging:
loglevel: debug
extensions:
health_check:
pprof:
endpoint: :1888
zpages:
endpoint: :55679
service:
extensions: [pprof, zpages, health_check]
pipelines:
metrics:
receivers: [otlp, statsd]
processors: [batch/metrics]
exporters: [logging, prometheusremotewrite, awsemf]
metrics/envoy:
receivers: [prometheus]
processors: [batch/metrics]
exporters: [logging, prometheusremotewrite, awsemf]
metrics/ecs:
receivers: [awsecscontainermetrics]
processors: [filter, batch/metrics]
exporters: [logging, prometheusremotewrite, awsemf]
traces:
receivers: [otlp, awsxray]
processors: [batch/traces]
exporters: [awsxray]
2023-02-16T14:46:24.621Z    info    service/telemetry.go:90 Setting up own telemetry...
2023-02-16T14:46:24.621Z    info    service/telemetry.go:116    Serving Prometheus metrics  {     "address": ":8888",     "level": "Basic" }
2023-02-16T14:46:24.630Z    info    exporter/exporter.go:290    Development component. May change in the future.    {     "kind": "exporter",     "data_type": "metrics",     "name": "logging" }
2023-02-16T14:46:24.630Z    warn    loggingexporter@v0.70.0/factory.go:109  'loglevel' option is deprecated in favor of 'verbosity'. Set 'verbosity' to equivalent value to preserve behavior.  {     "kind": "exporter",     "data_type": "metrics",     "name": "logging",     "loglevel": "debug",     "equivalent verbosity level": "Detailed" }
2023-02-16T14:46:24.637Z    info    filterprocessor@v0.70.0/metrics.go:97   Metric filter configured    {     "kind": "processor",     "name": "filter",     "pipeline": "metrics/ecs",     "include match_type": "strict",     "include expressions": [],     "include metric names": [         "ecs.task.memory.utilized",         "ecs.task.memory.reserved",         "ecs.task.memory.usage",         "ecs.task.cpu.utilized",         "ecs.task.cpu.reserved",         "ecs.task.cpu.usage.vcpu",         "ecs.task.network.rate.rx",         "ecs.task.network.rate.tx",         "ecs.task.storage.read_bytes",         "ecs.task.storage.write_bytes"     ],     "include metrics with resource attributes": null,     "exclude match_type": "",     "exclude expressions": [],     "exclude metric names": [],     "exclude metrics with resource attributes": null }
2023-02-16T14:46:24.638Z    info    awsxrayreceiver@v0.70.0/receiver.go:58  Going to listen on endpoint for X-Ray segments  {     "kind": "receiver",     "name": "awsxray",     "pipeline": "traces",     "udp": "0.0.0.0:2000" }
2023-02-16T14:46:24.638Z    info    udppoller/poller.go:106 Listening on endpoint for X-Ray segments    {     "kind": "receiver",     "name": "awsxray",     "pipeline": "traces",     "udp": "0.0.0.0:2000" }
2023-02-16T14:46:24.638Z    info    awsxrayreceiver@v0.70.0/receiver.go:69  Listening on endpoint for X-Ray segments    {     "kind": "receiver",     "name": "awsxray",     "pipeline": "traces",     "udp": "0.0.0.0:2000" }
2023-02-16T14:46:24.640Z    info    service/service.go:128  Starting aws-otel-collector...  {     "Version": "v0.26.1",     "NumCPU": 2 }
2023-02-16T14:46:24.640Z    info    extensions/extensions.go:41 Starting extensions...
2023-02-16T14:46:24.640Z    info    extensions/extensions.go:44 Extension is starting...    {     "kind": "extension",     "name": "zpages" }
2023-02-16T14:46:24.640Z    info    zpagesextension@v0.70.0/zpagesextension.go:64   Registered zPages span processor on tracer provider {     "kind": "extension",     "name": "zpages" }
2023-02-16T14:46:24.640Z    info    zpagesextension@v0.70.0/zpagesextension.go:74   Registered Host's zPages    {     "kind": "extension",     "name": "zpages" }
2023-02-16T14:46:24.640Z    info    zpagesextension@v0.70.0/zpagesextension.go:86   Starting zPages extension   {     "kind": "extension",     "name": "zpages",     "config": {         "TCPAddr": {             "Endpoint": ":55679"         }     } }
2023-02-16T14:46:24.640Z    info    extensions/extensions.go:48 Extension started.  {     "kind": "extension",     "name": "zpages" }
2023-02-16T14:46:24.640Z    info    extensions/extensions.go:44 Extension is starting...    {     "kind": "extension",     "name": "health_check" }
2023-02-16T14:46:24.640Z    info    healthcheckextension@v0.70.0/healthcheckextension.go:45 Starting health_check extension {     "kind": "extension",     "name": "health_check",     "config": {         "Endpoint": "0.0.0.0:13133",         "TLSSetting": null,         "CORS": null,         "Auth": null,         "MaxRequestBodySize": 0,         "IncludeMetadata": false,         "Path": "/",         "CheckCollectorPipeline": {             "Enabled": false,             "Interval": "5m",             "ExporterFailureThreshold": 5         }     } }
2023-02-16T14:46:24.641Z    warn    internal/warning.go:51  Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks    {     "kind": "extension",     "name": "health_check",     "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks" }
2023-02-16T14:46:24.641Z    info    extensions/extensions.go:48 Extension started.  {     "kind": "extension",     "name": "health_check" }
2023-02-16T14:46:24.641Z    info    extensions/extensions.go:44 Extension is starting...    {     "kind": "extension",     "name": "pprof" }
2023-02-16T14:46:24.641Z    info    pprofextension@v0.70.0/pprofextension.go:71 Starting net/http/pprof server  {     "kind": "extension",     "name": "pprof",     "config": {         "TCPAddr": {             "Endpoint": ":1888"         },         "BlockProfileFraction": 0,         "MutexProfileFraction": 0,         "SaveToFile": ""     } }
2023-02-16T14:46:24.641Z    info    extensions/extensions.go:48 Extension started.  {     "kind": "extension",     "name": "pprof" }
2023-02-16T14:46:24.641Z    info    service/pipelines.go:86 Starting exporters...
2023-02-16T14:46:24.641Z    info    service/pipelines.go:90 Exporter is starting... {     "kind": "exporter",     "data_type": "traces",     "name": "awsxray" }
2023-02-16T14:46:24.641Z    info    service/pipelines.go:94 Exporter started.   {     "kind": "exporter",     "data_type": "traces",     "name": "awsxray" }
2023-02-16T14:46:24.641Z    info    service/pipelines.go:90 Exporter is starting... {     "kind": "exporter",     "data_type": "metrics",     "name": "logging" }
2023-02-16T14:46:24.641Z    info    service/pipelines.go:94 Exporter started.   {     "kind": "exporter",     "data_type": "metrics",     "name": "logging" }
2023-02-16T14:46:24.641Z    info    service/pipelines.go:90 Exporter is starting... {     "kind": "exporter",     "data_type": "metrics",     "name": "prometheusremotewrite" }
2023-02-16T14:46:24.641Z    info    service/service.go:154  Starting shutdown...
2023-02-16T14:46:24.641Z    info    healthcheck/handler.go:129  Health Check state change   {     "kind": "extension",     "name": "health_check",     "status": "unavailable" }
2023-02-16T14:46:24.641Z    info    service/pipelines.go:130    Stopping receivers...
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x25529d6]
goroutine 1 [running]:
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awsecscontainermetricsreceiver.(*awsEcsContainerMetricsReceiver).Shutdown(0x0?, {0x3de05ee?, 0x75e419?})
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awsecscontainermetricsreceiver@v0.70.0/receiver.go:82 +0x16
go.opentelemetry.io/collector/service.(*builtPipelines).ShutdownAll(0xc00082cc80, {0x44fa0b0, 0xc000126000})
go.opentelemetry.io/collector@v0.70.0/service/pipelines.go:133 +0x499
go.opentelemetry.io/collector/service.(*Service).Shutdown(0xc00052a000, {0x44fa0b0, 0xc000126000})
go.opentelemetry.io/collector@v0.70.0/service/service.go:160 +0xd9
go.opentelemetry.io/collector/otelcol.(*Collector).setupConfigurationComponents(0xc000b9f980, {0x44fa0b0, 0xc000126000})
go.opentelemetry.io/collector@v0.70.0/otelcol/collector.go:181 +0x5a8
go.opentelemetry.io/collector/otelcol.(*Collector).Run(0xc000b9f980, {0x44fa0b0, 0xc000126000})
go.opentelemetry.io/collector@v0.70.0/otelcol/collector.go:205 +0x65
main.newCommand.func1(0xc00020e900, {0x3db707e?, 0x1?, 0x1?})
github.com/aws-observability/aws-otel-collector/cmd/awscollector/main.go:122 +0x267
github.com/spf13/cobra.(*Command).execute(0xc00020e900, {0xc000122010, 0x1, 0x1})
github.com/spf13/cobra@v1.6.1/command.go:916 +0x862
github.com/spf13/cobra.(*Command).ExecuteC(0xc00020e900)
github.com/spf13/cobra@v1.6.1/command.go:1044 +0x3bd
github.com/spf13/cobra.(*Command).Execute(...)
github.com/spf13/cobra@v1.6.1/command.go:968
main.runInteractive({{0xc00091e000, 0xc00091e1e0, 0xc00091e210, 0xc00074ffb0, 0x0}, {{0x3dd5ef8, 0x12}, {0x3dd480c, 0x12}, {0x44b6870, ...}}, ...})
github.com/aws-observability/aws-otel-collector/cmd/awscollector/main.go:84 +0x5e
main.run({{0xc00091e000, 0xc00091e1e0, 0xc00091e210, 0xc00074ffb0, 0x0}, {{0x3dd5ef8, 0x12}, {0x3dd480c, 0x12}, {0x44b6870, ...}}, ...})
github.com/aws-observability/aws-otel-collector/cmd/awscollector/main_others.go:42 +0xf8
main.main()
github.com/aws-observability/aws-otel-collector/cmd/awscollector/main.go:77 +0x2be
bryan-aguilar commented 1 year ago

@davetbo-amzn thanks for the report! I have filed a PR upstream to fix this. I'll leave this open until I can confidently say what version of the ADOT Collector the fix will be a part of.

davetbo-amzn commented 1 year ago

Thanks for the quick response, @bryan-aguilar! Is this something different than the SIGSEV that was originally in this thread? Might it be that my stack somehow pulled an old version of the collector? This was part of a Proton workshop so I'm not completely familiar with how they set it up.

If it's possible I have an old version, how would I check my version?

davetbo-amzn commented 1 year ago

This config works:

      Value: !Sub "receivers:  \n  prometheus:\n    config:\n      global:\n        scrape_interval: 1m\n        scrape_timeout: 10s\n      scrape_configs:\n      - job_name: \"appmesh-envoy\"\n        sample_limit: 10000\n        metrics_path: /stats/prometheus\n        static_configs:\n          - targets: ['0.0.0.0:9901']\n  awsecscontainermetrics:\n    collection_interval: 15s\n  otlp:\n    protocols:\n      grpc:\n        endpoint: 0.0.0.0:4317\n      http:\n        endpoint: 0.0.0.0:55681\n  awsxray:\n    endpoint: 0.0.0.0:2000\n    transport: udp\n  statsd:\n    endpoint: 0.0.0.0:8125\n    aggregation_interval: 60s\nprocessors:\n  batch/traces:\n    timeout: 1s\n    send_batch_size: 50\n  batch/metrics:\n    timeout: 60s\n  filter:\n    metrics:\n      include:\n        match_type: strict\n        metric_names:\n          - ecs.task.memory.utilized\n          - ecs.task.memory.reserved\n          - ecs.task.memory.usage\n          - ecs.task.cpu.utilized\n          - ecs.task.cpu.reserved\n          - ecs.task.cpu.usage.vcpu\n          - ecs.task.network.rate.rx\n          - ecs.task.network.rate.tx\n          - ecs.task.storage.read_bytes\n          - ecs.task.storage.write_bytes\nexporters:\n  awsxray:\n  prometheusremotewrite:\n    endpoint: ${PrometheusWorkspace.PrometheusEndpoint}api/v1/remote_write\n    resource_to_telemetry_conversion:\n      enabled: true\n  awsemf:\n    namespace: ECS/AWSOtel/Application\n    log_group_name: '/ecs/application/metrics/{ClusterName}'\n    log_stream_name: '/{TaskDefinitionFamily}/{TaskId}'\n    resource_to_telemetry_conversion:\n      enabled: true\n    dimension_rollup_option: NoDimensionRollup\n    metric_declarations:\n      - dimensions: [ [ ClusterName, TaskDefinitionFamily ] ]\n        metric_name_selectors:\n          - \"^envoy_http_downstream_rq_(total|xx)$\"\n          - \"^envoy_cluster_upstream_cx_(r|t)x_bytes_total$\"\n          - \"^envoy_cluster_membership_(healthy|total)$\"\n          - \"^envoy_server_memory_(allocated|heap_size)$\"\n          - \"^envoy_cluster_upstream_cx_(connect_timeout|destroy_local_with_active_rq)$\"\n          - \"^envoy_cluster_upstream_rq_(pending_failure_eject|pending_overflow|timeout|per_try_timeout|rx_reset|maintenance_mode)$\"\n          - \"^envoy_http_downstream_cx_destroy_remote_active_rq$\"\n          - \"^envoy_cluster_upstream_flow_control_(paused_reading_total|resumed_reading_total|backed_up_total|drained_total)$\"\n          - \"^envoy_cluster_upstream_rq_retry$\"\n          - \"^envoy_cluster_upstream_rq_retry_(success|overflow)$\"\n          - \"^envoy_server_(version|uptime|live)$\"\n        label_matchers:\n          - label_names:\n              - container_name\n            regex: ^envoy$\n      - dimensions: [ [ ClusterName, TaskDefinitionFamily, envoy_http_conn_manager_prefix, envoy_response_code_class ] ]\n        metric_name_selectors:\n          - \"^envoy_http_downstream_rq_xx$\"\n        label_matchers:\n          - label_names:\n              - container_name\n            regex: ^envoy$\n  logging:\n    loglevel: debug\nextensions:\n  health_check:\n  pprof:\n    endpoint: :1888\n  zpages:\n    endpoint: :55679\nservice:\n  extensions: [pprof, zpages, health_check]\n  pipelines:\n    metrics:\n      receivers: [otlp, statsd]\n      processors: [batch/metrics]\n      exporters: [logging, prometheusremotewrite, awsemf]\n    metrics/envoy:\n      receivers: [prometheus]\n      processors: [batch/metrics]\n      exporters: [logging, prometheusremotewrite, awsemf]\n    metrics/ecs:\n      receivers: [awsecscontainermetrics]\n      processors: [filter, batch/metrics]\n      exporters: [logging, prometheusremotewrite, awsemf]\n    traces:\n      receivers: [otlp, awsxray]\n      processors: [batch/traces]\n      exporters: [awsxray]\n"

Or presented with the \n turned into newlines:

receivers:  
  prometheus:
    config:
      global:
        scrape_interval: 1m
        scrape_timeout: 10s
      scrape_configs:
      - job_name: \"appmesh-envoy\"
        sample_limit: 10000
        metrics_path: /stats/prometheus
        static_configs:
          - targets: ['0.0.0.0:9901']
  awsecscontainermetrics:
    collection_interval: 15s
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:55681
  awsxray:
    endpoint: 0.0.0.0:2000
    transport: udp
  statsd:
    endpoint: 0.0.0.0:8125
    aggregation_interval: 60s
processors:
  batch/traces:
    timeout: 1s
    send_batch_size: 50
  batch/metrics:
    timeout: 60s
  filter:
    metrics:
      include:
        match_type: strict
        metric_names:
          - ecs.task.memory.utilized
          - ecs.task.memory.reserved
          - ecs.task.memory.usage
          - ecs.task.cpu.utilized
          - ecs.task.cpu.reserved
          - ecs.task.cpu.usage.vcpu
          - ecs.task.network.rate.rx
          - ecs.task.network.rate.tx
          - ecs.task.storage.read_bytes
          - ecs.task.storage.write_bytes
exporters:
  awsxray:
  prometheusremotewrite:
    endpoint: ${PrometheusWorkspace.PrometheusEndpoint}api/v1/remote_write
    resource_to_telemetry_conversion:
      enabled: true
  awsemf:
    namespace: ECS/AWSOtel/Application
    log_group_name: '/ecs/application/metrics/{ClusterName}'
    log_stream_name: '/{TaskDefinitionFamily}/{TaskId}'
    resource_to_telemetry_conversion:
      enabled: true
    dimension_rollup_option: NoDimensionRollup
    metric_declarations:
      - dimensions: [ [ ClusterName, TaskDefinitionFamily ] ]
        metric_name_selectors:
          - \"^envoy_http_downstream_rq_(total|xx)$\"
          - \"^envoy_cluster_upstream_cx_(r|t)x_bytes_total$\"
          - \"^envoy_cluster_membership_(healthy|total)$\"
          - \"^envoy_server_memory_(allocated|heap_size)$\"
          - \"^envoy_cluster_upstream_cx_(connect_timeout|destroy_local_with_active_rq)$\"
          - \"^envoy_cluster_upstream_rq_(pending_failure_eject|pending_overflow|timeout|per_try_timeout|rx_reset|maintenance_mode)$\"
          - \"^envoy_http_downstream_cx_destroy_remote_active_rq$\"
          - \"^envoy_cluster_upstream_flow_control_(paused_reading_total|resumed_reading_total|backed_up_total|drained_total)$\"
          - \"^envoy_cluster_upstream_rq_retry$\"
          - \"^envoy_cluster_upstream_rq_retry_(success|overflow)$\"
          - \"^envoy_server_(version|uptime|live)$\"
        label_matchers:
          - label_names:
              - container_name
            regex: ^envoy$
      - dimensions: [ [ ClusterName, TaskDefinitionFamily, envoy_http_conn_manager_prefix, envoy_response_code_class ] ]
        metric_name_selectors:
          - \"^envoy_http_downstream_rq_xx$\"
        label_matchers:
          - label_names:
              - container_name
            regex: ^envoy$
  logging:
    loglevel: debug
extensions:
  health_check:
  pprof:
    endpoint: :1888
  zpages:
    endpoint: :55679
service:
  extensions: [pprof, zpages, health_check]
  pipelines:
    metrics:
      receivers: [otlp, statsd]
      processors: [batch/metrics]
      exporters: [logging, prometheusremotewrite, awsemf]
    metrics/envoy:
      receivers: [prometheus]
      processors: [batch/metrics]
      exporters: [logging, prometheusremotewrite, awsemf]
    metrics/ecs:
      receivers: [awsecscontainermetrics]
      processors: [filter, batch/metrics]
      exporters: [logging, prometheusremotewrite, awsemf]
    traces:
      receivers: [otlp, awsxray]
      processors: [batch/traces]
      exporters: [awsxray]

Here's the diff:

diff old.yml new.yml
50d49
<     region: us-east-1
53,54c52,53
<     auth:
<       authenticator: sigv4auth
---
>     resource_to_telemetry_conversion:
>       enabled: true
113a113
> "
bryan-aguilar commented 1 year ago

The sigsegv you reported was due to an unchecked nil value in the shutdown process of awsecscontainermetrics receiver. The original report was an error in prometheus receiver. They do not appear related other than both being segmentation faults.

Aneurysm9 commented 1 year ago

Thanks for the quick response, @bryan-aguilar! Is this something different than the SIGSEV that was originally in this thread? Might it be that my stack somehow pulled an old version of the collector? This was part of a Proton workshop so I'm not completely familiar with how they set it up.

Yes, this was a different issue. Or, rather, a different instance of the same class of issue. The original report related to metric adjustment in the Prometheus receiver that failed to check whether a pointer was nil prior to using it. Your issue related to shutdown of the awscontainerinsights receiver failing to check whether a function pointer was nil prior to using it.

If it's possible I have an old version, how would I check my version?

You can see your version in the logs:

2023-02-16T14:46:24.640Z    info    service/service.go:128  Starting aws-otel-collector...  {     "Version": "v0.26.1",     "NumCPU": 2 }
davetbo-amzn commented 1 year ago

Thanks for the quick responses, all!

vsakaram commented 1 year ago

Update: Fix is merged as part of upstream collector release v0.72 and would be available as part of next ADOT collector release in about a week.

vsakaram commented 1 year ago

@davetbo-amzn we have released ADOT collector (https://aws-otel.github.io/docs/ReleaseBlogs/aws-distro-for-opentelemetry-collector-v0.27.0) earlier this week addressing this issue.

davetbo-amzn commented 1 year ago

That seems to have resolved the error. Thanks!