SigNoz / signoz

SigNoz is an open-source observability platform native to OpenTelemetry with logs, traces and metrics in a single application. An open-source alternative to DataDog, NewRelic, etc. 🔥 🖥. 👉 Open source Application Performance Monitoring (APM) & Observability tool
https://signoz.io
Other
19.21k stars 1.26k forks source link

Failed to process entry #1857

Closed markuman closed 1 year ago

markuman commented 1 year ago

Bug description

We ware using fluentd for logging in the docker-compose file

    logging:
      driver: "fluentd"
      options:
        fluentd-address: localhost:24224

That works fine in 0.11.4, but is failing now in 0.12.0

clickhouse-setup-otel-collector-1          | 2022-12-12T11:17:27.497Z    error    helper/transformer.go:110    Failed to process entry    {"kind": "receiver", "name": "filelog/dockercontainers", "pipeline": "logs", "operator_id": "parser-docker", "operator_type": "json_parser", "error": "ReadMapCB: expect { or n, but found \u0000, error found in #1 byte of ...|\u0000\u0000\u0001\u0010|..., bigger context ...|\u0000\u0000\u0001\u0010|...", "action": "send", "entry": {"observed_timestamp":"2022-12-12T11:17:27.4977265Z","timestamp":"0001-01-01T00:00:00Z","body":"\u0000\u0000\u0001\u0010","attributes":{"log.file.path":"/var/lib/docker/containers/c399a493738c55dedfef60509b05998363f96f5c8fe332e5246907bb243c91d3/container-cached.log"},"severity":0,"scope_name":""}}

Expected behavior

How to reproduce

1. 2. 3.

Version information

Additional context

Thank you for your bug report – we love squashing them!

welcome[bot] commented 1 year ago

Thanks for opening this issue. A team member should give feedback soon. In the meantime, feel free to check out the contributing guidelines.

nityanandagohain commented 1 year ago

From the error, it seems it's for the logs coming from filelog/dockercontainers as there is no mention of fluentD.

Can you share your otel collector config so that we can debug this further.

markuman commented 1 year ago

@nityanandagohain

docker-compose.yml

version: "2.4"

x-clickhouse-defaults: &clickhouse-defaults
  restart: on-failure
  image: clickhouse/clickhouse-server:22.8.8-alpine
  tty: true
  depends_on:
    - zookeeper-1
#  logging:
#    options:
#      max-size: 50m
#      max-file: "3"
  healthcheck:
    # "clickhouse", "client", "-u ${CLICKHOUSE_USER}", "--password ${CLICKHOUSE_PASSWORD}", "-q 'SELECT 1'"
    test: ["CMD", "wget", "--spider", "-q", "localhost:8123/ping"]
    interval: 30s
    timeout: 5s
    retries: 3
  ulimits:
    nproc: 65535
    nofile:
      soft: 262144
      hard: 262144

x-clickhouse-depend: &clickhouse-depend
  depends_on:
    clickhouse:
      condition: service_healthy

services:
  zookeeper-1:
   logging:
     driver: "fluentd"
     options:
       fluentd-address: localhost:24224
    image: bitnami/zookeeper:3.7.0
    container_name: zookeeper-1
    hostname: zookeeper-1
    user: root
    ports:
      - "2181:2181"
      - "2888:2888"
      - "3888:3888"
    volumes:
      - ./data/zookeeper-1:/bitnami/zookeeper
    environment:
      - ZOO_SERVER_ID=1
      # - ZOO_SERVERS=0.0.0.0:2888:3888,zookeeper-2:2888:3888,zookeeper-3:2888:3888
      - ALLOW_ANONYMOUS_LOGIN=yes
      - ZOO_AUTOPURGE_INTERVAL=1

  clickhouse:
    <<: *clickhouse-defaults
   logging:
     driver: "fluentd"
     options:
       fluentd-address: localhost:24224
    container_name: clickhouse
    hostname: clickhouse
    ports:
      - "9000:9000"
      - "8123:8123"
      - "9181:9181"
    volumes:
      - ./clickhouse-config.xml:/etc/clickhouse-server/config.xml
      - ./clickhouse-users.xml:/etc/clickhouse-server/users.xml
      - ./clickhouse-cluster.xml:/etc/clickhouse-server/config.d/cluster.xml
      # - ./clickhouse-storage.xml:/etc/clickhouse-server/config.d/storage.xml
      - ./data/clickhouse/:/var/lib/clickhouse/

  alertmanager:
   logging:
     driver: "fluentd"
     options:
       fluentd-address: localhost:24224
    image: signoz/alertmanager:0.23.0-0.2
    volumes:
      - ./data/alertmanager:/data
    depends_on:
      query-service:
        condition: service_healthy
    restart: on-failure
    command:
      - --queryService.url=http://query-service:8085
      - --storage.path=/data

# Notes for Maintainers/Contributors who will change Line Numbers of Frontend & Query-Section. Please Update Line Numbers in `./scripts/commentLinesForSetup.sh` & `./CONTRIBUTING.md`

  query-service:
    <<: *clickhouse-depend
   logging:
     driver: "fluentd"
     options:
       fluentd-address: localhost:24224
    image: signoz/query-service:0.12.0
    container_name: query-service
    command: ["-config=/root/config/prometheus.yml"]
    # ports:
    #   - "6060:6060"     # pprof port
    #   - "8080:8080"     # query-service port
    volumes:
      - ./prometheus.yml:/root/config/prometheus.yml
      - ../dashboards:/root/config/dashboards
      - ./data/signoz/:/var/lib/signoz/
    environment:
      - ClickHouseUrl=tcp://clickhouse:9000/?database=signoz_traces
      - ALERTMANAGER_API_PREFIX=http://alertmanager:9093/api/
      - SIGNOZ_LOCAL_DB_PATH=/var/lib/signoz/signoz.db
      - DASHBOARDS_PATH=/root/config/dashboards
      - STORAGE=clickhouse
      - GODEBUG=netdns=go
      - TELEMETRY_ENABLED=true
      - DEPLOYMENT_TYPE=docker-standalone-amd
    restart: on-failure
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "localhost:8080/api/v1/version"]
      interval: 30s
      timeout: 5s
      retries: 3

  frontend:
   logging:
     driver: "fluentd"
     options:
       fluentd-address: localhost:24224
    image: signoz/frontend:0.12.0
    container_name: frontend
    restart: on-failure
    depends_on:
      - alertmanager
      - query-service
    ports:
      - "3301:3301"
    volumes:
      - ../common/nginx-config.conf:/etc/nginx/conf.d/default.conf

  otel-collector:
    <<: *clickhouse-depend
   logging:
     driver: "fluentd"
     options:
       fluentd-address: localhost:24224
    image: signoz/signoz-otel-collector:0.66.0
    command: ["--config=/etc/otel-collector-config.yaml"]
    user: root # required for reading docker container logs
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
    environment:
      - OTEL_RESOURCE_ATTRIBUTES=host.name=signoz-host,os.type=linux
      - DOCKER_MULTI_NODE_CLUSTER=false
    ports:
      # - "1777:1777"     # pprof extension
      - "4317:4317"     # OTLP gRPC receiver
      - "4318:4318"     # OTLP HTTP receiver
      # - "8888:8888"     # OtelCollector internal metrics
      # - "8889:8889"     # signoz spanmetrics exposed by the agent
      # - "9411:9411"     # Zipkin port
      # - "13133:13133"   # health check extension
      # - "14250:14250"   # Jaeger gRPC
      # - "14268:14268"   # Jaeger thrift HTTP
      # - "55678:55678"   # OpenCensus receiver
      # - "55679:55679"   # zPages extension
    restart: on-failure

  otel-collector-metrics:
    <<: *clickhouse-depend
   logging:
     driver: "fluentd"
     options:
       fluentd-address: localhost:24224
    image: signoz/signoz-otel-collector:0.66.0
    command: ["--config=/etc/otel-collector-metrics-config.yaml"]
    volumes:
      - ./otel-collector-metrics-config.yaml:/etc/otel-collector-metrics-config.yaml
    # ports:
    #   - "1777:1777"     # pprof extension
    #   - "8888:8888"     # OtelCollector internal metrics
    #   - "13133:13133"   # Health check extension
    #   - "55679:55679"   # zPages extension
    restart: on-failure

fluent-bit config for debugging/testing

[SERVICE]
    Flush     1
    Daemon    off
    Log_Level info

[INPUT]
    Name forward
    Listen localhost
    port 24224
    tag_prefix signoz

[OUTPUT]
    Name stdout
    Match **
nityanandagohain commented 1 year ago

can you share otel-collector-config.yaml as well ?

markuman commented 1 year ago

@nityanandagohain

receivers:
  filelog/dockercontainers:
    include: [  "/var/lib/docker/containers/*/*.log" ]
    start_at: end
    include_file_path: true
    include_file_name: false
    operators:
    - type: json_parser
      id: parser-docker
      output: extract_metadata_from_filepath
      timestamp:
        parse_from: attributes.time
        layout: '%Y-%m-%dT%H:%M:%S.%LZ'
    - type: regex_parser
      id: extract_metadata_from_filepath
      regex: '^.*containers/(?P<container_id>[^_]+)/.*log$'
      parse_from: attributes["log.file.path"]
      output: parse_body
    - type: move
      id: parse_body
      from: attributes.log
      to: body
      output: time
    - type: remove
      id: time
      field: attributes.time
  opencensus:
    endpoint: 0.0.0.0:55678
  otlp/spanmetrics:
    protocols:
      grpc:
        endpoint: localhost:12345
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  jaeger:
    protocols:
      grpc:
        endpoint: 0.0.0.0:14250
      thrift_http:
        endpoint: 0.0.0.0:14268
      # thrift_compact:
      #   endpoint: 0.0.0.0:6831
      # thrift_binary:
      #   endpoint: 0.0.0.0:6832
  hostmetrics:
    collection_interval: 30s
    scrapers:
      cpu: {}
      load: {}
      memory: {}
      disk: {}
      filesystem: {}
      network: {}
  prometheus:
    config:
      global:
        scrape_interval: 60s
      scrape_configs:
        # otel-collector internal metrics
        - job_name: otel-collector
          static_configs:
          - targets:
            - localhost:8888

processors:
  batch:
    send_batch_size: 10000
    send_batch_max_size: 11000
    timeout: 10s
  signozspanmetrics/prometheus:
    metrics_exporter: prometheus
    latency_histogram_buckets: [100us, 1ms, 2ms, 6ms, 10ms, 50ms, 100ms, 250ms, 500ms, 1000ms, 1400ms, 2000ms, 5s, 10s, 20s, 40s, 60s ]
    dimensions_cache_size: 10000
    dimensions:
      - name: service.namespace
        default: default
      - name: deployment.environment
        default: default
  # memory_limiter:
  #   # 80% of maximum memory up to 2G
  #   limit_mib: 1500
  #   # 25% of limit up to 2G
  #   spike_limit_mib: 512
  #   check_interval: 5s
  #
  #   # 50% of the maximum memory
  #   limit_percentage: 50
  #   # 20% of max memory usage spike expected
  #   spike_limit_percentage: 20
  # queued_retry:
  #   num_workers: 4
  #   queue_size: 100
  #   retry_on_failure: true
  resourcedetection:
    # Using OTEL_RESOURCE_ATTRIBUTES envvar, env detector adds custom labels.
    detectors: [env, system] # include ec2 for AWS, gce for GCP and azure for Azure.
    timeout: 2s

extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  zpages:
    endpoint: 0.0.0.0:55679
  pprof:
    endpoint: 0.0.0.0:1777

exporters:
  clickhousetraces:
    datasource: tcp://clickhouse:9000/?database=signoz_traces
    docker_multi_node_cluster: ${DOCKER_MULTI_NODE_CLUSTER}

  clickhousemetricswrite:
    endpoint: tcp://clickhouse:9000/?database=signoz_metrics
    resource_to_telemetry_conversion:
      enabled: true
  prometheus:
    endpoint: 0.0.0.0:8889
  # logging: {}

  clickhouselogsexporter:
    dsn: tcp://clickhouse:9000/
    docker_multi_node_cluster: ${DOCKER_MULTI_NODE_CLUSTER}
    timeout: 5s
    sending_queue:
      queue_size: 100
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s

service:
  telemetry:
    metrics:
      address: 0.0.0.0:8888
  extensions:
    - health_check
    - zpages
    - pprof
  pipelines:
    traces:
      receivers: [jaeger, otlp]
      processors: [signozspanmetrics/prometheus, batch]
      exporters: [clickhousetraces]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [clickhousemetricswrite]
    metrics/generic:
      receivers: [hostmetrics, prometheus]
      processors: [resourcedetection, batch]
      exporters: [clickhousemetricswrite]
    metrics/spanmetrics:
      receivers: [otlp/spanmetrics]
      exporters: [prometheus]
    logs:
      receivers: [otlp, filelog/dockercontainers]
      processors: [batch]
      exporters: [clickhouselogsexporter]
nityanandagohain commented 1 year ago

@markuman your otel collector config doesn't seem to contain the fluent-forward receiver.

I have tried to reproduce this, but it's working fine on my machine.

The error that you have added is for docker container logs from file, you can disable it if you are using logging driver directly.

That works fine in 0.11.4, but is failing now in 0.12.0

Here is my otel collector config

otel collector config ``` receivers: fluentforward: endpoint: 0.0.0.0:24224 filelog/dockercontainers: include: [ "/var/lib/docker/containers/*/*.log" ] start_at: end include_file_path: true include_file_name: false operators: - type: json_parser id: parser-docker output: extract_metadata_from_filepath timestamp: parse_from: attributes.time layout: '%Y-%m-%dT%H:%M:%S.%LZ' - type: regex_parser id: extract_metadata_from_filepath regex: '^.*containers/(?P[^_]+)/.*log$' parse_from: attributes["log.file.path"] output: parse_body - type: move id: parse_body from: attributes.log to: body output: time - type: remove id: time field: attributes.time opencensus: endpoint: 0.0.0.0:55678 otlp/spanmetrics: protocols: grpc: endpoint: localhost:12345 otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 jaeger: protocols: grpc: endpoint: 0.0.0.0:14250 thrift_http: endpoint: 0.0.0.0:14268 # thrift_compact: # endpoint: 0.0.0.0:6831 # thrift_binary: # endpoint: 0.0.0.0:6832 hostmetrics: collection_interval: 30s scrapers: cpu: {} load: {} memory: {} disk: {} filesystem: {} network: {} prometheus: config: global: scrape_interval: 60s scrape_configs: # otel-collector internal metrics - job_name: otel-collector static_configs: - targets: - localhost:8888 labels: job_name: otel-collector processors: batch: send_batch_size: 10000 send_batch_max_size: 11000 timeout: 10s signozspanmetrics/prometheus: metrics_exporter: prometheus latency_histogram_buckets: [100us, 1ms, 2ms, 6ms, 10ms, 50ms, 100ms, 250ms, 500ms, 1000ms, 1400ms, 2000ms, 5s, 10s, 20s, 40s, 60s ] dimensions_cache_size: 100000 dimensions: - name: service.namespace default: default - name: deployment.environment default: default # memory_limiter: # # 80% of maximum memory up to 2G # limit_mib: 1500 # # 25% of limit up to 2G # spike_limit_mib: 512 # check_interval: 5s # # # 50% of the maximum memory # limit_percentage: 50 # # 20% of max memory usage spike expected # spike_limit_percentage: 20 # queued_retry: # num_workers: 4 # queue_size: 100 # retry_on_failure: true resourcedetection: # Using OTEL_RESOURCE_ATTRIBUTES envvar, env detector adds custom labels. detectors: [env, system] # include ec2 for AWS, gce for GCP and azure for Azure. timeout: 2s extensions: health_check: endpoint: 0.0.0.0:13133 zpages: endpoint: 0.0.0.0:55679 pprof: endpoint: 0.0.0.0:1777 exporters: clickhousetraces: datasource: tcp://clickhouse:9000/?database=signoz_traces docker_multi_node_cluster: ${DOCKER_MULTI_NODE_CLUSTER} clickhousemetricswrite: endpoint: tcp://clickhouse:9000/?database=signoz_metrics resource_to_telemetry_conversion: enabled: true clickhousemetricswrite/prometheus: endpoint: tcp://clickhouse:9000/?database=signoz_metrics prometheus: endpoint: 0.0.0.0:8889 # logging: {} clickhouselogsexporter: dsn: tcp://clickhouse:9000/ docker_multi_node_cluster: ${DOCKER_MULTI_NODE_CLUSTER} timeout: 5s sending_queue: queue_size: 100 retry_on_failure: enabled: true initial_interval: 5s max_interval: 30s max_elapsed_time: 300s service: telemetry: metrics: address: 0.0.0.0:8888 extensions: - health_check - zpages - pprof pipelines: traces: receivers: [jaeger, otlp] processors: [signozspanmetrics/prometheus, batch] exporters: [clickhousetraces] metrics: receivers: [otlp] processors: [batch] exporters: [clickhousemetricswrite] metrics/generic: receivers: [hostmetrics] processors: [resourcedetection, batch] exporters: [clickhousemetricswrite] metrics/prometheus: receivers: [prometheus] processors: [batch] exporters: [clickhousemetricswrite/prometheus] metrics/spanmetrics: receivers: [otlp/spanmetrics] exporters: [prometheus] logs: receivers: [otlp,fluentforward] processors: [batch] exporters: [clickhouselogsexporter] ```
Log generator ``` version: "2.3" services: flog-2: image: nityag123/flog:latest # Output fake log in JSON format with traces command: [ "--format=json_with_trace", "-k", "1500", "--bytes=5550000" ] stop_signal: SIGKILL logging: driver: fluentd options: fluentd-address: localhost:24224 ```

Also expose the port in otel-collector

ports:
      - "24224:24224"

please try the above configs and let us know.

nityanandagohain commented 1 year ago

@markuman Let me know if the above solution worked.

nityanandagohain commented 1 year ago

Closing this as no update.