Closed rpasche closed 3 months ago
I tried to reproduce it with your setup of telemetrygen + collector + apm-server (on main branch) but failed. ES is returning all 500k documents.
2024-07-02T18:57:18.708+0100 INFO metrics/worker.go:107 metrics generated {"worker": 0, "metrics": 500000}
2024-07-02T18:57:18.708+0100 INFO metrics/worker.go:43 stopping the exporter {"worker": 0}
{"apm-server":{"acm":{"request":{"count":0},"response":{"count":0,"errors":{"closed":0,"count":0,"decode":0,"forbidden":0,"internal":0,"invalidquery":0,"method":0,"notfound":0,"queue":0,"ratelimit":0,"timeout":0,"toolarge":0,"unauthorized":0,"unavailable":0,"validate":0},"valid":{"accepted":0,"count":0,"notmodified":0,"ok":0}},"unset":0},"agentcfg":{"elasticsearch":{"cache.entries.count":0,"cache.refresh.failures":0,"cache.refresh.successes":5,"fetch.es":0,"fetch.fallback":0,"fetch.invalid":0,"fetch.unavailable":0}},"jaeger":{"grpc":{"collect":{"request":{"count":0},"response":{"count":0,"errors":{"count":0,"ratelimit":0,"timeout":0,"unauthorized":0},"valid":{"count":0}}},"sampling":{"event":{"received":{"count":0}},"request":{"count":0},"response":{"count":0,"errors":{"count":0},"valid":{"count":0}}}}},"otlp":{"grpc":{"logs":{"request":{"count":0},"response":{"count":0,"errors":{"count":0,"ratelimit":0,"timeout":0,"unauthorized":0},"valid":{"count":0}}},"metrics":{"consumer":{"unsupported_dropped":0},"request":{"count":639},"response":{"count":639,"errors":{"count":0,"ratelimit":0,"timeout":0,"unauthorized":0},"valid":{"count":639}}},"traces":{"request":{"count":0},"response":{"count":0,"errors":{"count":0,"ratelimit":0,"timeout":0,"unauthorized":0},"valid":{"count":0}}}},"http":{"logs":{"request":{"count":0},"response":{"count":0,"errors":{"count":0,"ratelimit":0,"timeout":0,"unauthorized":0},"valid":{"count":0}}},"metrics":{"consumer":{"unsupported_dropped":0},"request":{"count":0},"response":{"count":0,"errors":{"count":0,"ratelimit":0,"timeout":0,"unauthorized":0},"valid":{"count":0}}},"traces":{"request":{"count":0},"response":{"count":0,"errors":{"count":0,"ratelimit":0,"timeout":0,"unauthorized":0},"valid":{"count":0}}}}},"processor":{"metric":{"transformations":500003},"stream":{"accepted":0,"errors":{"invalid":0,"toolarge":0}}},"root":{"request":{"count":1},"response":{"count":1,"errors":{"closed":0,"count":0,"decode":0,"forbidden":0,"internal":0,"invalidquery":0,"method":0,"notfound":0,"queue":0,"ratelimit":0,"timeout":0,"toolarge":0,"unauthorized":0,"unavailable":0,"validate":0},"valid":{"accepted":0,"count":1,"notmodified":0,"ok":1}},"unset":0},"sampling":{"transactions_dropped":0},"server":{"request":{"count":0},"response":{"count":0,"errors":{"closed":0,"count":0,"decode":0,"forbidden":0,"internal":0,"invalidquery":0,"method":0,"notfound":0,"queue":0,"ratelimit":0,"timeout":0,"toolarge":0,"unauthorized":0,"unavailable":0,"validate":0},"valid":{"accepted":0,"count":0,"notmodified":0,"ok":0}},"unset":0}},"beat":{"cgroup":{"cpu":{"id":"session-c2.scope","stats":{"periods":0,"throttled":{"ns":0,"periods":0}}},"memory":{"id":"session-c2.scope","mem":{"usage":{"bytes":41311178752}}}},"cpu":{"system":{"ticks":950,"time":{"ms":950}},"total":{"ticks":10970,"time":{"ms":10970},"value":10970},"user":{"ticks":10020,"time":{"ms":10020}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":15},"info":{"ephemeral_id":"79a06409-db09-47ae-8e6d-98c28f66a49b","name":"apm-server","uptime":{"ms":142860},"version":"8.15.0"},"memstats":{"gc_next":199843960,"memory_alloc":173394320,"memory_sys":259242296,"memory_total":2350429840,"rss":264138752},"runtime":{"goroutines":48}},"libbeat":{"output":{"events":{"acked":500003,"active":0,"batches":126,"failed":0,"toomany":0,"total":500003},"type":"elasticsearch","write":{"bytes":2553816}},"pipeline":{"events":{"total":500003}}},"output":{"elasticsearch":{"bulk_requests":{"available":60,"completed":126},"indexers":{"active":1,"created":0,"destroyed":0}}},"system":{"cpu":{"cores":16},"load":{"1":5.28,"15":4.58,"5":4.51,"norm":{"1":0.33,"15":0.2863,"5":0.2819}}}}
(also seem to be missing 2 documents).
The 2 "missing" documents could be some other internal metrics, and I wouldn't worry too much about it. But I agree that only 66.1% of datapoints are ingested may reveal a problem somewhere.
Hey @carsonip , thanks for reporting back. I will try to reproduces this without the collector in between. May I ask, how it took telemetry-gen
to send those 500000 metrics?
Ok. Did retest it directly from telemetry-gen to APM (8.13.3). The number of "batches" matches 500
(which totally makes sense, as it was configured to run in batches of 1000
. But still, a lot of "metrics" are missing.
"apm-server": {
"acm": {
"request": {
"count": 0
},
"response": {
"count": 0,
"errors": {
"closed": 0,
"count": 0,
"decode": 0,
"forbidden": 0,
"internal": 0,
"invalidquery": 0,
"method": 0,
"notfound": 0,
"queue": 0,
"ratelimit": 0,
"timeout": 0,
"toolarge": 0,
"unauthorized": 0,
"unavailable": 0,
"validate": 0
},
"valid": {
"accepted": 0,
"count": 0,
"notmodified": 0,
"ok": 0
}
},
"unset": 0
},
"agentcfg": {
"elasticsearch": {
"cache.entries.count": 0,
"cache.refresh.failures": 0,
"cache.refresh.successes": 27,
"fetch.es": 0,
"fetch.fallback": 0,
"fetch.invalid": 0,
"fetch.unavailable": 0
}
},
"jaeger": {
"grpc": {
"collect": {
"request": {
"count": 0
},
"response": {
"count": 0,
"errors": {
"count": 0,
"ratelimit": 0,
"timeout": 0,
"unauthorized": 0
},
"valid": {
"count": 0
}
}
},
"sampling": {
"event": {
"received": {
"count": 0
}
},
"request": {
"count": 0
},
"response": {
"count": 0,
"errors": {
"count": 0
},
"valid": {
"count": 0
}
}
}
}
},
"otlp": {
"grpc": {
"logs": {
"request": {
"count": 0
},
"response": {
"count": 0,
"errors": {
"count": 0,
"ratelimit": 0,
"timeout": 0,
"unauthorized": 0
},
"valid": {
"count": 0
}
}
},
"metrics": {
"consumer": {
"unsupported_dropped": 0
},
"request": {
"count": 500
},
"response": {
"count": 500,
"errors": {
"count": 0,
"ratelimit": 0,
"timeout": 0,
"unauthorized": 0
},
"valid": {
"count": 500
}
}
},
"traces": {
"request": {
"count": 0
},
"response": {
"count": 0,
"errors": {
"count": 0,
"ratelimit": 0,
"timeout": 0,
"unauthorized": 0
},
"valid": {
"count": 0
}
}
}
},
"http": {
"logs": {
"request": {
"count": 0
},
"response": {
"count": 0,
"errors": {
"count": 0,
"ratelimit": 0,
"timeout": 0,
"unauthorized": 0
},
"valid": {
"count": 0
}
}
},
"metrics": {
"consumer": {
"unsupported_dropped": 0
},
"request": {
"count": 0
},
"response": {
"count": 0,
"errors": {
"count": 0,
"ratelimit": 0,
"timeout": 0,
"unauthorized": 0
},
"valid": {
"count": 0
}
}
},
"traces": {
"request": {
"count": 0
},
"response": {
"count": 0,
"errors": {
"count": 0,
"ratelimit": 0,
"timeout": 0,
"unauthorized": 0
},
"valid": {
"count": 0
}
}
}
}
},
"processor": {
"metric": {
"transformations": 309287
},
"stream": {
"accepted": 0,
"errors": {
"invalid": 0,
"toolarge": 0
}
}
},
"root": {
"request": {
"count": 152
},
"response": {
"count": 152,
"errors": {
"closed": 0,
"count": 0,
"decode": 0,
"forbidden": 0,
"internal": 0,
"invalidquery": 0,
"method": 0,
"notfound": 0,
"queue": 0,
"ratelimit": 0,
"timeout": 0,
"toolarge": 0,
"unauthorized": 0,
"unavailable": 0,
"validate": 0
},
"valid": {
"accepted": 0,
"count": 152,
"notmodified": 0,
"ok": 152
}
},
"unset": 0
},
"sampling": {
"transactions_dropped": 0
},
"server": {
"request": {
"count": 0
},
"response": {
"count": 0,
"errors": {
"closed": 0,
"count": 0,
"decode": 0,
"forbidden": 0,
"internal": 0,
"invalidquery": 0,
"method": 0,
"notfound": 0,
"queue": 0,
"ratelimit": 0,
"timeout": 0,
"toolarge": 0,
"unauthorized": 0,
"unavailable": 0,
"validate": 0
},
"valid": {
"accepted": 0,
"count": 0,
"notmodified": 0,
"ok": 0
}
},
"unset": 0
}
},
"beat": {
"cgroup": {
"cpu": {
"id": "/",
"stats": {
"periods": 715,
"throttled": {
"ns": 0,
"periods": 0
}
}
},
"memory": {
"id": "/",
"mem": {
"usage": {
"bytes": 124637184
}
}
}
},
"cpu": {
"system": {
"ticks": 770,
"time": {
"ms": 770
}
},
"total": {
"ticks": 8510,
"time": {
"ms": 8510
},
"value": 8510
},
"user": {
"ticks": 7740,
"time": {
"ms": 7740
}
}
},
"handles": {
"limit": {
"hard": 1048576,
"soft": 1048576
},
"open": 17
},
"info": {
"ephemeral_id": "82b42d56-27c2-4662-a40f-e7f60679044e",
"name": "apm-server",
"uptime": {
"ms": 783586
},
"version": "8.13.3"
},
"memstats": {
"gc_next": 125044976,
"memory_alloc": 92164936,
"memory_sys": 170929448,
"memory_total": 1227695232,
"rss": 156528640
},
"runtime": {
"goroutines": 35
}
},
"libbeat": {
"output": {
"events": {
"acked": 309287,
"active": 0,
"batches": 41,
"failed": 0,
"toomany": 0,
"total": 309287
},
"type": "elasticsearch",
"write": {
"bytes": 1555288
}
},
"pipeline": {
"events": {
"total": 309287
}
}
},
"output": {
"elasticsearch": {
"bulk_requests": {
"available": 13,
"completed": 41
},
"indexers": {
"active": 1,
"created": 0,
"destroyed": 0
}
}
},
"system": {
"cpu": {
"cores": 6
},
"load": {
"1": 0.15,
"15": 0.37,
"5": 0.21,
"norm": {
"1": 0.025,
"15": 0.0617,
"5": 0.035
}
}
}
}
I know I'm not the GO expert and I adjusted telemetry-gen
on my workstation to be able to send batches of metrics
and logs
, but because the OTEL Collector is also reporting the exact same number of metrics
(and traces
and logs
) have been sent, I still doubt, that the problem lies on telemetry-gen
.
Maybe something got fixed in the main
branch of APM
@carsonip what is the meaning of these metrics in the es output?
"output": {
"elasticsearch": {
"bulk_requests": {
"available": 13,
"completed": 41
},
I am specifically thinking of the "available" number, which suspiciously is 1/3rd of all the completed requests, just like 1/3rd of the metrics is missing.
but because the OTEL Collector is also reporting the exact same number of metrics (and traces and logs) have been sent, I still doubt, that the problem lies on telemetry-gen.
@rpasche are you saying that in your latest test the number of metrics ingested by an OTel collector and APM is the same? And this is also valid for traces and logs? Did I understand correctly that your setup is
telemetry-gen -> OTel colelctor -> APM server
If yes, is it possible that the OTel collector is aggregating metrics somehow, before sending them into APM?
@inge4pres no. What I wanted to say, is, that I only have that issue when sending metrics
. I didn't see that issue when sending traces
or logs
.
And I had again the issue when using telemetry-gen -> APM.
how it took telemetry-gen to send those 500000 metrics?
If you're asking how long it took, for me it took 2m7s.
@carsonip what is the meaning of these metrics in the es output? I am specifically thinking of the "available" number, which suspiciously is 1/3rd of all the completed requests, just like 1/3rd of the metrics is missing.
AFAIK it is not at all what it means. "available" should be the number of bulk requests waiting for work. I believe it is merely a coincidence. See apm-server code and docappender code
Ok. Did retest it directly from telemetry-gen to APM (8.13.3). The number of "batches" matches 500 (which totally makes sense, as it was configured to run in batches of 1000
@rpasche where is this "1000" config?
@carsonip what is the meaning of these metrics in the es output? I am specifically thinking of the "available" number, which suspiciously is 1/3rd of all the completed requests, just like 1/3rd of the metrics is missing.
AFAIK it is not at all what it means. "available" should be the number of bulk requests waiting for work. I believe it is merely a coincidence. See apm-server code and docappender code
Ok. Did retest it directly from telemetry-gen to APM (8.13.3). The number of "batches" matches 500 (which totally makes sense, as it was configured to run in batches of 1000
@rpasche where is this "1000" config?
@carsonip I mentioned in the beginning issue https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/33427 where I noticed, that for metrics
and logs
, telemetry-gen
does send every single log
of trace
in its own request. So I made a poor adjustment with a fixed batchsize of 1000
, to be able to generate higher load (as it works with traces
). Because if I use the latest version of telemetry-gen
, I still have that problem. Trying to send 500000
metrics would take hours
to send. See again the example just sending 500
metrics. It takes 23 seconds
2024-07-04T08:22:14.804+0200 INFO grpc@v1.65.0/clientconn.go:539 [core][Channel #1]Channel Connectivity change to READY {"system": "grpc", "grpc_log": true}
2024-07-04T08:22:37.660+0200 INFO metrics/worker.go:107 metrics generated {"worker": 0, "metrics": 500}
2024-07-04T08:22:37.660+0200 INFO metrics/worker.go:43 stopping the exporter {"worker": 0}
This is also the reason I wondered, how you were able to send those 500000
metrics that fast (in just ~2 minutes).
I was not yet able to check the lastest main
branch of APM. I want to test this as well.
Just an update. Was able to build a new docker image based on latest main
branch. Test still needs to be done.
If you are only experiencing this issue with metrics, but not logs or traces, here's a possible explanation. In apm-server, in the same otlp batch, under the same resource and scope metrics, we group metric data points by timestamp, data point attributes, and store them as 1 Elasticsearch document.
In my experiment to reproduce the issue, the timestamps are never the same, but it is possible that some data points share the same timestamp due to reasons like a system clock with lower precision, which triggered the grouping behavior, and resulting in fewer documents than number of data points. This is expected. You may find the code here.
Feel free to craft an input otlp dataset that has a different timestamp for every data point, and see if documents are still "missing".
Closing this issue for now, as apm-server is behaving as expected.
HI @carsonip ,
ok. I know the basic idea to group multiple metrics or datapoints with the same attributes or labels from the metricbeat
prometheus module (see https://github.com/elastic/beats/blob/main/metricbeat/module/prometheus/remote_write/data.go#L60)
And it is - of course - good to group them, if possible, into a single document. The problem that "we" (maybe others) have, is, that is now unclear, if truly all datapoints have been processed. What would be good to see (from the internal metrics), is that in total, those 500000
datapoints / metrics have been successfully processed, but in the end, they all live only in 309287
documents. That would already be sufficient. Right now, only looking to those numbers, it looks like only 309287
have "reached" APM. And otel collector and APM both report, that everything is ok.
To my understanding, telemetry-gen
only creates 1 datapoint per metric (https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/cmd/telemetrygen/internal/metrics/worker.go#L57)
Will try to come up with another test.
I understand your concern. While it appears like 309287 have reached APM, the truth is that only 309287 data points are processed meaningfully, because in a synthetic test like telemetrygen, there are data points from same metric sharing the same timestamp and attributes, and they are effectively silently dropped in the process of grouping. The later data point takes precedence in the current implementation. However, I don't think data points with the same metric name, same timestamp, and same attributes are common in real environments.
That said, I'll check with the team for any area to improve, especially in making it easier to observe if that really happens.
APM Server version (
apm-server version
):8.13.3
Description of the problem including expected versus actual behavior:
APM-server seems to drop some metric points, but we cannot determine, which and why. We noticed this while testing to have the
otel-collector-contrib
running and sending all 3 signals to an APM-server viaotlp/gRPC
. The number of documents created in elastic did not match the number ofmetrics
generated and passed through the collector.When sending
traces
orlogs
, everything seems to work fine.Steps to reproduce:
Please include a minimal but complete recreation of the problem, including server configuration, agent(s) used, etc. The easier you make it for us to reproduce it, the more likely that somebody will take the time to look at it.
http
server enabledotel-collector
running forwarding to APM (currently, I do not have a minimal setup, so I try to wright an example config here)metrics
(ortraces
orlogs
for cross-checks).Please note my issue at https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/33427 where - at least for me - sending high volume of
logs
ormetrics
was not really possibleProvide logs (if relevant):
Make sure, that the collector and APM server have been restarted (to clear the counters) and that nothing else is sending there.
Generated
500000
metrics withNote the
metrics generated {"worker": 0, "metrics": 500000}
In Grafana / Prometheus, check the metrics
Connect to the APM server
monitoring
port (default: 5066) and get thestats
(GET apm:5066/stats
)Compare the numbers from prometheus with counters provided by APM server. Also check the number of documents ingested in to final index within
elasticsearch
See my logs and pictures from Grafana and APM stats
otelcol_exporter_sent_metric_points
andotelcol_processor_batch_batch_send_size_sum
shows that also in total500000
metrics have beensent
and all of them have been sent inbatches
. The number ofbatches
is:132
See the output of the
stats
:otlp.grpc.metrics.request.count
should match132
(the number ofbatches
) butlibbeat.pipeline.events.total
is way lower. It's only332092
(same asprocessor.metric.transformations: 332092
)See also screenshot from elasticsearch (also seem to be missing
2
documents).Please also note: The APM index created in elasticsearch is NOT a
TSDS
(index.mode: time_series
)If I redo those tests with
logs
ortraces
, the number ofbatches
and total number oftraces
orlogs
reported by the OTEL collector and APM do match.Just to summarize: In this test, only
66.1%
of total datapoints sent have been "seen" (processed?) by APM. We also do not have any kind offilter
configured within APM.