elastic / apm-server

https://www.elastic.co/guide/en/apm/guide/current/index.html
Other
1.22k stars 521 forks source link

apm-server drops lot of metrics. Unclear, where this happens #13541

Closed rpasche closed 3 months ago

rpasche commented 3 months ago

APM Server version (apm-server version):

8.13.3

Description of the problem including expected versus actual behavior:

APM-server seems to drop some metric points, but we cannot determine, which and why. We noticed this while testing to have the otel-collector-contrib running and sending all 3 signals to an APM-server via otlp/gRPC. The number of documents created in elastic did not match the number of metrics generated and passed through the collector.

When sending traces or logs, everything seems to work fine.

Steps to reproduce:

Please include a minimal but complete recreation of the problem, including server configuration, agent(s) used, etc. The easier you make it for us to reproduce it, the more likely that somebody will take the time to look at it.

  1. Run the APM binary connected to elasticsearch with the http server enabled
    monitoring.enabled: false
    http:
    enabled: true
    ...
  2. Have an otel-collector running forwarding to APM (currently, I do not have a minimal setup, so I try to wright an example config here)
    exporters:
    otlp/apm:
    endpoint: apm-server:8200
    extensions:
    health_check: {}
    zpages: {}
    processors:
    batch: {}
    memory_limiter:
    check_interval: 5s
    limit_percentage: 80
    spike_limit_percentage: 19
    receivers:
    otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
    service:
    extensions:
    - health_check
    - zpages
    pipelines:
    logs:
      exporters:
      - otlp/apm
      processors:
      - memory_limiter
      - batch
      receivers:
      - otlp
    metrics:
      exporters:
      - otlp/apm
      processors:
      - memory_limiter
      - batch
      receivers:
      - otlp
    traces:
      exporters:
      - otlp/apm
      processors:
      - memory_limiter
      - batch
      receivers:
      - otlp
    telemetry:
    logs:
      encoding: json
      level: info
    metrics:
      address: 0.0.0.0:8888
      level: detailed
  3. It will help to have Prometheus running and scraping the collector
  4. Use telemetry-gen to generate some metrics (or traces or logs for cross-checks).

Please note my issue at https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/33427 where - at least for me - sending high volume of logs or metrics was not really possible

Provide logs (if relevant):

Make sure, that the collector and APM server have been restarted (to clear the counters) and that nothing else is sending there.

Generated 500000 metrics with

./telemetrygen metrics --otlp-endpoint <my-collector-endpoint>  --metrics=500000 --rate=0 
2024-07-02T08:52:48.008+0200    INFO    metrics/metrics.go:32   starting the metrics generator with configuration
...
2024-07-02T08:52:48.008+0200    INFO    metrics/metrics.go:83   generation of metrics isn't being throttled
2024-07-02T08:52:48.008+0200    INFO    metrics/metrics.go:51   starting gRPC exporter
...
2024-07-02T08:52:48.320+0200    INFO    grpc@v1.64.0/balancer_wrapper.go:103    [core][Channel #1]Channel switches to new LB policy "pick_first"    {"system": "grpc", "grpc_log": true}
2024-07-02T08:52:48.320+0200    INFO    grpc@v1.64.0/balancer_wrapper.go:170    [core][Channel #1 SubChannel #2]Subchannel created  {"system": "grpc", "grpc_log": true}
2024-07-02T08:52:48.320+0200    INFO    grpc@v1.64.0/clientconn.go:535  [core][Channel #1]Channel Connectivity change to CONNECTING {"system": "grpc", "grpc_log": true}
2024-07-02T08:52:48.320+0200    INFO    grpc@v1.64.0/clientconn.go:1210 [core][Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING    {"system": "grpc", "grpc_log": true}
2024-07-02T08:52:48.321+0200    INFO    grpc@v1.64.0/clientconn.go:1326 [core][Channel #1 SubChannel #2]Subchannel picks a new address "XXX" to connect {"system": "grpc", "grpc_log": true}
2024-07-02T08:52:48.472+0200    INFO    grpc@v1.64.0/clientconn.go:1210 [core][Channel #1 SubChannel #2]Subchannel Connectivity change to READY {"system": "grpc", "grpc_log": true}
2024-07-02T08:52:48.472+0200    INFO    grpc@v1.64.0/clientconn.go:535  [core][Channel #1]Channel Connectivity change to READY  {"system": "grpc", "grpc_log": true}
2024-07-02T08:53:14.846+0200    INFO    metrics/worker.go:120   metrics generated   {"worker": 0, "metrics": 500000}
2024-07-02T08:53:14.846+0200    INFO    metrics/worker.go:43    stopping the exporter   {"worker": 0}
2024-07-02T08:53:14.846+0200    INFO    grpc@v1.64.0/clientconn.go:535  [core][Channel #1]Channel Connectivity change to SHUTDOWN   {"system": "grpc", "grpc_log": true}
2024-07-02T08:53:14.846+0200    INFO    grpc@v1.64.0/resolver_wrapper.go:100    [core][Channel #1]Closing the name resolver {"system": "grpc", "grpc_log": true}
2024-07-02T08:53:14.846+0200    INFO    grpc@v1.64.0/balancer_wrapper.go:135    [core][Channel #1]ccBalancerWrapper: closing    {"system": "grpc", "grpc_log": true}
2024-07-02T08:53:14.846+0200    INFO    grpc@v1.64.0/clientconn.go:1210 [core][Channel #1 SubChannel #2]Subchannel Connectivity change to SHUTDOWN  {"system": "grpc", "grpc_log": true}
2024-07-02T08:53:14.846+0200    INFO    grpc@v1.64.0/clientconn.go:1154 [core][Channel #1 SubChannel #2]Subchannel deleted  {"system": "grpc", "grpc_log": true}

Note the metrics generated {"worker": 0, "metrics": 500000}

In Grafana / Prometheus, check the metrics

otelcol_exporter_sent_metric_points
otelcol_processor_batch_batch_send_size_count
otelcol_processor_batch_batch_send_size_sum

Connect to the APM server monitoring port (default: 5066) and get the stats (GET apm:5066/stats)

Compare the numbers from prometheus with counters provided by APM server. Also check the number of documents ingested in to final index within elasticsearch

See my logs and pictures from Grafana and APM stats

image

otelcol_exporter_sent_metric_points and otelcol_processor_batch_batch_send_size_sum shows that also in total 500000 metrics have been sent and all of them have been sent in batches. The number of batches is: 132

See the output of the stats:

{
  "apm-server": {
    "acm": {
      "request": {
        "count": 0
      },
      "response": {
        "count": 0,
        "errors": {
          "closed": 0,
          "count": 0,
          "decode": 0,
          "forbidden": 0,
          "internal": 0,
          "invalidquery": 0,
          "method": 0,
          "notfound": 0,
          "queue": 0,
          "ratelimit": 0,
          "timeout": 0,
          "toolarge": 0,
          "unauthorized": 0,
          "unavailable": 0,
          "validate": 0
        },
        "valid": {
          "accepted": 0,
          "count": 0,
          "notmodified": 0,
          "ok": 0
        }
      },
      "unset": 0
    },
    "agentcfg": {
      "elasticsearch": {
        "cache.entries.count": 0,
        "cache.refresh.failures": 0,
        "cache.refresh.successes": 6,
        "fetch.es": 0,
        "fetch.fallback": 0,
        "fetch.invalid": 0,
        "fetch.unavailable": 0
      }
    },
    "jaeger": {
      "grpc": {
        "collect": {
          "request": {
            "count": 0
          },
          "response": {
            "count": 0,
            "errors": {
              "count": 0,
              "ratelimit": 0,
              "timeout": 0,
              "unauthorized": 0
            },
            "valid": {
              "count": 0
            }
          }
        },
        "sampling": {
          "event": {
            "received": {
              "count": 0
            }
          },
          "request": {
            "count": 0
          },
          "response": {
            "count": 0,
            "errors": {
              "count": 0
            },
            "valid": {
              "count": 0
            }
          }
        }
      }
    },
    "otlp": {
      "grpc": {
        "logs": {
          "request": {
            "count": 0
          },
          "response": {
            "count": 0,
            "errors": {
              "count": 0,
              "ratelimit": 0,
              "timeout": 0,
              "unauthorized": 0
            },
            "valid": {
              "count": 0
            }
          }
        },
        "metrics": {
          "consumer": {
            "unsupported_dropped": 0
          },
          "request": {
            "count": 132                                 <<==== see this
          },
          "response": {
            "count": 132,                                <<==== see this
            "errors": {
              "count": 0,
              "ratelimit": 0,
              "timeout": 0,
              "unauthorized": 0
            },
            "valid": {
              "count": 132                               <<==== see this
            }
          }
        },
        "traces": {
          "request": {
            "count": 0
          },
          "response": {
            "count": 0,
            "errors": {
              "count": 0,
              "ratelimit": 0,
              "timeout": 0,
              "unauthorized": 0
            },
            "valid": {
              "count": 0
            }
          }
        }
      },
      "http": {
        "logs": {
          "request": {
            "count": 0
          },
          "response": {
            "count": 0,
            "errors": {
              "count": 0,
              "ratelimit": 0,
              "timeout": 0,
              "unauthorized": 0
            },
            "valid": {
              "count": 0
            }
          }
        },
        "metrics": {
          "consumer": {
            "unsupported_dropped": 0
          },
          "request": {
            "count": 0
          },
          "response": {
            "count": 0,
            "errors": {
              "count": 0,
              "ratelimit": 0,
              "timeout": 0,
              "unauthorized": 0
            },
            "valid": {
              "count": 0
            }
          }
        },
        "traces": {
          "request": {
            "count": 0
          },
          "response": {
            "count": 0,
            "errors": {
              "count": 0,
              "ratelimit": 0,
              "timeout": 0,
              "unauthorized": 0
            },
            "valid": {
              "count": 0
            }
          }
        }
      }
    },
    "processor": {
      "metric": {
        "transformations": 332092                               <<==== see this
      },
      "stream": {
        "accepted": 0,
        "errors": {
          "invalid": 0,
          "toolarge": 0
        }
      }
    },
    "root": {
      "request": {
        "count": 24
      },
      "response": {
        "count": 24,
        "errors": {
          "closed": 0,
          "count": 0,
          "decode": 0,
          "forbidden": 0,
          "internal": 0,
          "invalidquery": 0,
          "method": 0,
          "notfound": 0,
          "queue": 0,
          "ratelimit": 0,
          "timeout": 0,
          "toolarge": 0,
          "unauthorized": 0,
          "unavailable": 0,
          "validate": 0
        },
        "valid": {
          "accepted": 0,
          "count": 24,
          "notmodified": 0,
          "ok": 24
        }
      },
      "unset": 0
    },
    "sampling": {
      "transactions_dropped": 0
    },
    "server": {
      "request": {
        "count": 0
      },
      "response": {
        "count": 0,
        "errors": {
          "closed": 0,
          "count": 0,
          "decode": 0,
          "forbidden": 0,
          "internal": 0,
          "invalidquery": 0,
          "method": 0,
          "notfound": 0,
          "queue": 0,
          "ratelimit": 0,
          "timeout": 0,
          "toolarge": 0,
          "unauthorized": 0,
          "unavailable": 0,
          "validate": 0
        },
        "valid": {
          "accepted": 0,
          "count": 0,
          "notmodified": 0,
          "ok": 0
        }
      },
      "unset": 0
    }
  },
  "beat": {
    "cgroup": {
      "cpu": {
        "id": "/",
        "stats": {
          "periods": 330,
          "throttled": {
            "ns": 0,
            "periods": 0
          }
        }
      },
      "memory": {
        "id": "/",
        "mem": {
          "usage": {
            "bytes": 145399808
          }
        }
      }
    },
    "cpu": {
      "system": {
        "ticks": 380,
        "time": {
          "ms": 380
        }
      },
      "total": {
        "ticks": 8460,
        "time": {
          "ms": 8460
        },
        "value": 8460
      },
      "user": {
        "ticks": 8080,
        "time": {
          "ms": 8080
        }
      }
    },
    "handles": {
      "limit": {
        "hard": 1048576,
        "soft": 1048576
      },
      "open": 16
    },
    "info": {
      "ephemeral_id": "a8e7858c-94a7-4e97-967a-37b86888275d",
      "name": "apm-server",
      "uptime": {
        "ms": 158022
      },
      "version": "8.13.3"
    },
    "memstats": {
      "gc_next": 135786032,
      "memory_alloc": 125878360,
      "memory_sys": 171388200,
      "memory_total": 1516219080,
      "rss": 179290112
    },
    "runtime": {
      "goroutines": 36
    }
  },
  "libbeat": {
    "output": {
      "events": {
        "acked": 332092,                                         <<==== see this
        "active": 0,
        "batches": 67,
        "failed": 0,
        "toomany": 0,
        "total": 332092                                            <<==== see this
      },
      "type": "elasticsearch",
      "write": {
        "bytes": 2001780
      }
    },
    "pipeline": {
      "events": {
        "total": 332092                                            <<==== see this
      }
    }
  },
  "output": {
    "elasticsearch": {
      "bulk_requests": {
        "available": 13,
        "completed": 67
      },
      "indexers": {
        "active": 1,
        "created": 0,
        "destroyed": 0
      }
    }
  },
  "system": {
    "cpu": {
      "cores": 6
    },
    "load": {
      "1": 0.13,
      "15": 0.12,
      "5": 0.1,
      "norm": {
        "1": 0.0217,
        "15": 0.02,
        "5": 0.0167
      }
    }
  }
}

otlp.grpc.metrics.request.count should match 132 (the number of batches) but libbeat.pipeline.events.total is way lower. It's only 332092 (same as processor.metric.transformations: 332092)

See also screenshot from elasticsearch (also seem to be missing 2 documents).

image

Please also note: The APM index created in elasticsearch is NOT a TSDS (index.mode: time_series)

If I redo those tests with logs or traces, the number of batches and total number of traces or logs reported by the OTEL collector and APM do match.

Just to summarize: In this test, only 66.1% of total datapoints sent have been "seen" (processed?) by APM. We also do not have any kind of filter configured within APM.

carsonip commented 3 months ago

I tried to reproduce it with your setup of telemetrygen + collector + apm-server (on main branch) but failed. ES is returning all 500k documents.

2024-07-02T18:57:18.708+0100    INFO    metrics/worker.go:107   metrics generated       {"worker": 0, "metrics": 500000}
2024-07-02T18:57:18.708+0100    INFO    metrics/worker.go:43    stopping the exporter   {"worker": 0}
{"apm-server":{"acm":{"request":{"count":0},"response":{"count":0,"errors":{"closed":0,"count":0,"decode":0,"forbidden":0,"internal":0,"invalidquery":0,"method":0,"notfound":0,"queue":0,"ratelimit":0,"timeout":0,"toolarge":0,"unauthorized":0,"unavailable":0,"validate":0},"valid":{"accepted":0,"count":0,"notmodified":0,"ok":0}},"unset":0},"agentcfg":{"elasticsearch":{"cache.entries.count":0,"cache.refresh.failures":0,"cache.refresh.successes":5,"fetch.es":0,"fetch.fallback":0,"fetch.invalid":0,"fetch.unavailable":0}},"jaeger":{"grpc":{"collect":{"request":{"count":0},"response":{"count":0,"errors":{"count":0,"ratelimit":0,"timeout":0,"unauthorized":0},"valid":{"count":0}}},"sampling":{"event":{"received":{"count":0}},"request":{"count":0},"response":{"count":0,"errors":{"count":0},"valid":{"count":0}}}}},"otlp":{"grpc":{"logs":{"request":{"count":0},"response":{"count":0,"errors":{"count":0,"ratelimit":0,"timeout":0,"unauthorized":0},"valid":{"count":0}}},"metrics":{"consumer":{"unsupported_dropped":0},"request":{"count":639},"response":{"count":639,"errors":{"count":0,"ratelimit":0,"timeout":0,"unauthorized":0},"valid":{"count":639}}},"traces":{"request":{"count":0},"response":{"count":0,"errors":{"count":0,"ratelimit":0,"timeout":0,"unauthorized":0},"valid":{"count":0}}}},"http":{"logs":{"request":{"count":0},"response":{"count":0,"errors":{"count":0,"ratelimit":0,"timeout":0,"unauthorized":0},"valid":{"count":0}}},"metrics":{"consumer":{"unsupported_dropped":0},"request":{"count":0},"response":{"count":0,"errors":{"count":0,"ratelimit":0,"timeout":0,"unauthorized":0},"valid":{"count":0}}},"traces":{"request":{"count":0},"response":{"count":0,"errors":{"count":0,"ratelimit":0,"timeout":0,"unauthorized":0},"valid":{"count":0}}}}},"processor":{"metric":{"transformations":500003},"stream":{"accepted":0,"errors":{"invalid":0,"toolarge":0}}},"root":{"request":{"count":1},"response":{"count":1,"errors":{"closed":0,"count":0,"decode":0,"forbidden":0,"internal":0,"invalidquery":0,"method":0,"notfound":0,"queue":0,"ratelimit":0,"timeout":0,"toolarge":0,"unauthorized":0,"unavailable":0,"validate":0},"valid":{"accepted":0,"count":1,"notmodified":0,"ok":1}},"unset":0},"sampling":{"transactions_dropped":0},"server":{"request":{"count":0},"response":{"count":0,"errors":{"closed":0,"count":0,"decode":0,"forbidden":0,"internal":0,"invalidquery":0,"method":0,"notfound":0,"queue":0,"ratelimit":0,"timeout":0,"toolarge":0,"unauthorized":0,"unavailable":0,"validate":0},"valid":{"accepted":0,"count":0,"notmodified":0,"ok":0}},"unset":0}},"beat":{"cgroup":{"cpu":{"id":"session-c2.scope","stats":{"periods":0,"throttled":{"ns":0,"periods":0}}},"memory":{"id":"session-c2.scope","mem":{"usage":{"bytes":41311178752}}}},"cpu":{"system":{"ticks":950,"time":{"ms":950}},"total":{"ticks":10970,"time":{"ms":10970},"value":10970},"user":{"ticks":10020,"time":{"ms":10020}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":15},"info":{"ephemeral_id":"79a06409-db09-47ae-8e6d-98c28f66a49b","name":"apm-server","uptime":{"ms":142860},"version":"8.15.0"},"memstats":{"gc_next":199843960,"memory_alloc":173394320,"memory_sys":259242296,"memory_total":2350429840,"rss":264138752},"runtime":{"goroutines":48}},"libbeat":{"output":{"events":{"acked":500003,"active":0,"batches":126,"failed":0,"toomany":0,"total":500003},"type":"elasticsearch","write":{"bytes":2553816}},"pipeline":{"events":{"total":500003}}},"output":{"elasticsearch":{"bulk_requests":{"available":60,"completed":126},"indexers":{"active":1,"created":0,"destroyed":0}}},"system":{"cpu":{"cores":16},"load":{"1":5.28,"15":4.58,"5":4.51,"norm":{"1":0.33,"15":0.2863,"5":0.2819}}}}

(also seem to be missing 2 documents).

The 2 "missing" documents could be some other internal metrics, and I wouldn't worry too much about it. But I agree that only 66.1% of datapoints are ingested may reveal a problem somewhere.

rpasche commented 3 months ago

Hey @carsonip , thanks for reporting back. I will try to reproduces this without the collector in between. May I ask, how it took telemetry-gen to send those 500000 metrics?

rpasche commented 3 months ago

Ok. Did retest it directly from telemetry-gen to APM (8.13.3). The number of "batches" matches 500 (which totally makes sense, as it was configured to run in batches of 1000. But still, a lot of "metrics" are missing.

  "apm-server": {
    "acm": {
      "request": {
        "count": 0
      },
      "response": {
        "count": 0,
        "errors": {
          "closed": 0,
          "count": 0,
          "decode": 0,
          "forbidden": 0,
          "internal": 0,
          "invalidquery": 0,
          "method": 0,
          "notfound": 0,
          "queue": 0,
          "ratelimit": 0,
          "timeout": 0,
          "toolarge": 0,
          "unauthorized": 0,
          "unavailable": 0,
          "validate": 0
        },
        "valid": {
          "accepted": 0,
          "count": 0,
          "notmodified": 0,
          "ok": 0
        }
      },
      "unset": 0
    },
    "agentcfg": {
      "elasticsearch": {
        "cache.entries.count": 0,
        "cache.refresh.failures": 0,
        "cache.refresh.successes": 27,
        "fetch.es": 0,
        "fetch.fallback": 0,
        "fetch.invalid": 0,
        "fetch.unavailable": 0
      }
    },
    "jaeger": {
      "grpc": {
        "collect": {
          "request": {
            "count": 0
          },
          "response": {
            "count": 0,
            "errors": {
              "count": 0,
              "ratelimit": 0,
              "timeout": 0,
              "unauthorized": 0
            },
            "valid": {
              "count": 0
            }
          }
        },
        "sampling": {
          "event": {
            "received": {
              "count": 0
            }
          },
          "request": {
            "count": 0
          },
          "response": {
            "count": 0,
            "errors": {
              "count": 0
            },
            "valid": {
              "count": 0
            }
          }
        }
      }
    },
    "otlp": {
      "grpc": {
        "logs": {
          "request": {
            "count": 0
          },
          "response": {
            "count": 0,
            "errors": {
              "count": 0,
              "ratelimit": 0,
              "timeout": 0,
              "unauthorized": 0
            },
            "valid": {
              "count": 0
            }
          }
        },
        "metrics": {
          "consumer": {
            "unsupported_dropped": 0
          },
          "request": {
            "count": 500
          },
          "response": {
            "count": 500,
            "errors": {
              "count": 0,
              "ratelimit": 0,
              "timeout": 0,
              "unauthorized": 0
            },
            "valid": {
              "count": 500
            }
          }
        },
        "traces": {
          "request": {
            "count": 0
          },
          "response": {
            "count": 0,
            "errors": {
              "count": 0,
              "ratelimit": 0,
              "timeout": 0,
              "unauthorized": 0
            },
            "valid": {
              "count": 0
            }
          }
        }
      },
      "http": {
        "logs": {
          "request": {
            "count": 0
          },
          "response": {
            "count": 0,
            "errors": {
              "count": 0,
              "ratelimit": 0,
              "timeout": 0,
              "unauthorized": 0
            },
            "valid": {
              "count": 0
            }
          }
        },
        "metrics": {
          "consumer": {
            "unsupported_dropped": 0
          },
          "request": {
            "count": 0
          },
          "response": {
            "count": 0,
            "errors": {
              "count": 0,
              "ratelimit": 0,
              "timeout": 0,
              "unauthorized": 0
            },
            "valid": {
              "count": 0
            }
          }
        },
        "traces": {
          "request": {
            "count": 0
          },
          "response": {
            "count": 0,
            "errors": {
              "count": 0,
              "ratelimit": 0,
              "timeout": 0,
              "unauthorized": 0
            },
            "valid": {
              "count": 0
            }
          }
        }
      }
    },
    "processor": {
      "metric": {
        "transformations": 309287
      },
      "stream": {
        "accepted": 0,
        "errors": {
          "invalid": 0,
          "toolarge": 0
        }
      }
    },
    "root": {
      "request": {
        "count": 152
      },
      "response": {
        "count": 152,
        "errors": {
          "closed": 0,
          "count": 0,
          "decode": 0,
          "forbidden": 0,
          "internal": 0,
          "invalidquery": 0,
          "method": 0,
          "notfound": 0,
          "queue": 0,
          "ratelimit": 0,
          "timeout": 0,
          "toolarge": 0,
          "unauthorized": 0,
          "unavailable": 0,
          "validate": 0
        },
        "valid": {
          "accepted": 0,
          "count": 152,
          "notmodified": 0,
          "ok": 152
        }
      },
      "unset": 0
    },
    "sampling": {
      "transactions_dropped": 0
    },
    "server": {
      "request": {
        "count": 0
      },
      "response": {
        "count": 0,
        "errors": {
          "closed": 0,
          "count": 0,
          "decode": 0,
          "forbidden": 0,
          "internal": 0,
          "invalidquery": 0,
          "method": 0,
          "notfound": 0,
          "queue": 0,
          "ratelimit": 0,
          "timeout": 0,
          "toolarge": 0,
          "unauthorized": 0,
          "unavailable": 0,
          "validate": 0
        },
        "valid": {
          "accepted": 0,
          "count": 0,
          "notmodified": 0,
          "ok": 0
        }
      },
      "unset": 0
    }
  },
  "beat": {
    "cgroup": {
      "cpu": {
        "id": "/",
        "stats": {
          "periods": 715,
          "throttled": {
            "ns": 0,
            "periods": 0
          }
        }
      },
      "memory": {
        "id": "/",
        "mem": {
          "usage": {
            "bytes": 124637184
          }
        }
      }
    },
    "cpu": {
      "system": {
        "ticks": 770,
        "time": {
          "ms": 770
        }
      },
      "total": {
        "ticks": 8510,
        "time": {
          "ms": 8510
        },
        "value": 8510
      },
      "user": {
        "ticks": 7740,
        "time": {
          "ms": 7740
        }
      }
    },
    "handles": {
      "limit": {
        "hard": 1048576,
        "soft": 1048576
      },
      "open": 17
    },
    "info": {
      "ephemeral_id": "82b42d56-27c2-4662-a40f-e7f60679044e",
      "name": "apm-server",
      "uptime": {
        "ms": 783586
      },
      "version": "8.13.3"
    },
    "memstats": {
      "gc_next": 125044976,
      "memory_alloc": 92164936,
      "memory_sys": 170929448,
      "memory_total": 1227695232,
      "rss": 156528640
    },
    "runtime": {
      "goroutines": 35
    }
  },
  "libbeat": {
    "output": {
      "events": {
        "acked": 309287,
        "active": 0,
        "batches": 41,
        "failed": 0,
        "toomany": 0,
        "total": 309287
      },
      "type": "elasticsearch",
      "write": {
        "bytes": 1555288
      }
    },
    "pipeline": {
      "events": {
        "total": 309287
      }
    }
  },
  "output": {
    "elasticsearch": {
      "bulk_requests": {
        "available": 13,
        "completed": 41
      },
      "indexers": {
        "active": 1,
        "created": 0,
        "destroyed": 0
      }
    }
  },
  "system": {
    "cpu": {
      "cores": 6
    },
    "load": {
      "1": 0.15,
      "15": 0.37,
      "5": 0.21,
      "norm": {
        "1": 0.025,
        "15": 0.0617,
        "5": 0.035
      }
    }
  }
}

I know I'm not the GO expert and I adjusted telemetry-gen on my workstation to be able to send batches of metrics and logs, but because the OTEL Collector is also reporting the exact same number of metrics (and traces and logs) have been sent, I still doubt, that the problem lies on telemetry-gen.

Maybe something got fixed in the main branch of APM

inge4pres commented 3 months ago

@carsonip what is the meaning of these metrics in the es output?

  "output": {
    "elasticsearch": {
      "bulk_requests": {
        "available": 13,
        "completed": 41
      },

I am specifically thinking of the "available" number, which suspiciously is 1/3rd of all the completed requests, just like 1/3rd of the metrics is missing.

but because the OTEL Collector is also reporting the exact same number of metrics (and traces and logs) have been sent, I still doubt, that the problem lies on telemetry-gen.

@rpasche are you saying that in your latest test the number of metrics ingested by an OTel collector and APM is the same? And this is also valid for traces and logs? Did I understand correctly that your setup is

telemetry-gen -> OTel colelctor -> APM server

If yes, is it possible that the OTel collector is aggregating metrics somehow, before sending them into APM?

rpasche commented 3 months ago

@inge4pres no. What I wanted to say, is, that I only have that issue when sending metrics. I didn't see that issue when sending traces or logs.

And I had again the issue when using telemetry-gen -> APM.

carsonip commented 3 months ago

how it took telemetry-gen to send those 500000 metrics?

If you're asking how long it took, for me it took 2m7s.

carsonip commented 3 months ago

@carsonip what is the meaning of these metrics in the es output? I am specifically thinking of the "available" number, which suspiciously is 1/3rd of all the completed requests, just like 1/3rd of the metrics is missing.

AFAIK it is not at all what it means. "available" should be the number of bulk requests waiting for work. I believe it is merely a coincidence. See apm-server code and docappender code

Ok. Did retest it directly from telemetry-gen to APM (8.13.3). The number of "batches" matches 500 (which totally makes sense, as it was configured to run in batches of 1000

@rpasche where is this "1000" config?

rpasche commented 3 months ago

@carsonip what is the meaning of these metrics in the es output? I am specifically thinking of the "available" number, which suspiciously is 1/3rd of all the completed requests, just like 1/3rd of the metrics is missing.

AFAIK it is not at all what it means. "available" should be the number of bulk requests waiting for work. I believe it is merely a coincidence. See apm-server code and docappender code

Ok. Did retest it directly from telemetry-gen to APM (8.13.3). The number of "batches" matches 500 (which totally makes sense, as it was configured to run in batches of 1000

@rpasche where is this "1000" config?

@carsonip I mentioned in the beginning issue https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/33427 where I noticed, that for metrics and logs, telemetry-gen does send every single log of trace in its own request. So I made a poor adjustment with a fixed batchsize of 1000, to be able to generate higher load (as it works with traces). Because if I use the latest version of telemetry-gen, I still have that problem. Trying to send 500000 metrics would take hours to send. See again the example just sending 500 metrics. It takes 23 seconds

2024-07-04T08:22:14.804+0200    INFO    grpc@v1.65.0/clientconn.go:539  [core][Channel #1]Channel Connectivity change to READY  {"system": "grpc", "grpc_log": true}
2024-07-04T08:22:37.660+0200    INFO    metrics/worker.go:107   metrics generated   {"worker": 0, "metrics": 500}
2024-07-04T08:22:37.660+0200    INFO    metrics/worker.go:43    stopping the exporter   {"worker": 0}

This is also the reason I wondered, how you were able to send those 500000 metrics that fast (in just ~2 minutes).

I was not yet able to check the lastest main branch of APM. I want to test this as well.

rpasche commented 3 months ago

Just an update. Was able to build a new docker image based on latest main branch. Test still needs to be done.

carsonip commented 3 months ago

If you are only experiencing this issue with metrics, but not logs or traces, here's a possible explanation. In apm-server, in the same otlp batch, under the same resource and scope metrics, we group metric data points by timestamp, data point attributes, and store them as 1 Elasticsearch document.

In my experiment to reproduce the issue, the timestamps are never the same, but it is possible that some data points share the same timestamp due to reasons like a system clock with lower precision, which triggered the grouping behavior, and resulting in fewer documents than number of data points. This is expected. You may find the code here.

Feel free to craft an input otlp dataset that has a different timestamp for every data point, and see if documents are still "missing".

Closing this issue for now, as apm-server is behaving as expected.

rpasche commented 3 months ago

HI @carsonip ,

ok. I know the basic idea to group multiple metrics or datapoints with the same attributes or labels from the metricbeat prometheus module (see https://github.com/elastic/beats/blob/main/metricbeat/module/prometheus/remote_write/data.go#L60)

And it is - of course - good to group them, if possible, into a single document. The problem that "we" (maybe others) have, is, that is now unclear, if truly all datapoints have been processed. What would be good to see (from the internal metrics), is that in total, those 500000 datapoints / metrics have been successfully processed, but in the end, they all live only in 309287 documents. That would already be sufficient. Right now, only looking to those numbers, it looks like only 309287 have "reached" APM. And otel collector and APM both report, that everything is ok.

To my understanding, telemetry-gen only creates 1 datapoint per metric (https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/cmd/telemetrygen/internal/metrics/worker.go#L57)

Will try to come up with another test.

carsonip commented 3 months ago

I understand your concern. While it appears like 309287 have reached APM, the truth is that only 309287 data points are processed meaningfully, because in a synthetic test like telemetrygen, there are data points from same metric sharing the same timestamp and attributes, and they are effectively silently dropped in the process of grouping. The later data point takes precedence in the current implementation. However, I don't think data points with the same metric name, same timestamp, and same attributes are common in real environments.

That said, I'll check with the team for any area to improve, especially in making it easier to observe if that really happens.