Closed tarvip closed 2 years ago
Longer log can be seen here: s3-plugin.log.gz
Hi @tarvip! Thank you for your feedback. Are you using this with jaeger all-in-one
strategy or production
strategy?
2021-11-12T20:05:07.183Z [DEBUG] stdio: received EOF, stopping recv loop: err="rpc error: code = Unimplemented desc = unknown service plugin.GRPCStdio"
is normal. Just ignore that. I'm interested to know about the strategy you used and whether the configuration right.
Also can you add a tls_handshake_timeout
configuration (0 means don't timeout)
Hi @tarvip! My guess is that you're running all-in-one strategy.
My initial finding is that by switch to production strategy helps reduce memory usage and in turn prevent the jaeger-s3 process from being killed.
Actually I was running using production
strategy, I have separate collector and query pod. Initially I tried with 1 collector
pod, as it is fine when using Elasticsearch as storage, but I also tried with 4 collector
pods, pods had no cpu limits (no throttling or whatsoever). When looking Elasticsearch indexing rate, it is about 40 ops/s.
@tarvip Thanks for the information. So you were running Elasticsearch along with jaeger-s3 side by side?
That's quite weird in the logs that tells the crash happened when you're writing to the object storage. For my case the fix i did earlier was on the reading part of object storage. I'll take a look when i can on this.
So you were running Elasticsearch along with jaeger-s3 side by side?
No, I disabled Elasticsearch in collector when testing jaeger-s3 plugin.
I also tried 1.1.3
a bit earlier, same issue, then I saw that 1.1.4
was also available.
So you were running Elasticsearch along with jaeger-s3 side by side?
No, I disabled Elasticsearch in collector when testing jaeger-s3 plugin. I also tried
1.1.3
a bit earlier, same issue, then I saw that1.1.4
was also available.
I see. I am hoping to pinpoint the exact issue, is the high memory usage coming from jeager-query or collector? (you didn't mention it earlier) and were you querying from Jaeger-UI (or some UI like Grafana) at the same time?
The fixes i did was for object storage reads (queries). If you're talking about collector
pod it means i missed something on the object storage writes (to s3).
I have not encountered the collector problem (my problem was jaeger-query was crashing because of cortex) but i will try talk to my team to stress test it on a busy cluster to collect and write tracing data to s3.
I haven't enabled jaeger-s3
in query
, I wanted to get trace writing to s3 working in collector
first. Memory usage starts increasing pretty fast so it will crash quite soon (less than 1m) after startup. My suspicion is that it is unable to write traces to S3 fast enough and because of that memory usage starts growing.
@tarvip Yes, i had refactored the writer code a bit so there will only be one type of data for LogQL labels so it cuts down writer requests to s3 in version 1.2.0.
I've also made the code to follow loki data structure so you can do complex queries in loki with the jaeger data written by jaeger-s3.
Hope you will try it out and see if it fixes your collector issue. I didn't have a really busy kubernetes cluster to try it out because we're only tracing istio.
But i will look into this again because i will use the loki provided code (but you won't need loki, just internally, in jaeger-s3 i will leverage on loki) to write to s3 instead.
@muhammadn thank you for your hard work. I haven't tried new version because I switched over to Grafana Tempo, Tempo has also search support now, also, I think it is more future proof.
@tarvip I am reopening this issue with a fix on the memory problem.
@muhammadn sorry I have no time and resources to test it, we are already using Grafana Tempo and so far it works well.
Hi.
Thank you for creating such plugin. I tried to use it, but it seems that it somewhat works, I can see data in S3 bucket, but then memory usage starts increasing until eventually it gets killed.
Also there is interesting error in log:
Then later:
I tried version
1.1.4
with jaeger1.28.0