flitnetics / jaeger-objectstorage

Jaeger plugin for object storage datastore
Apache License 2.0
47 stars 7 forks source link

Plugin stops working #49

Closed tarvip closed 2 years ago

tarvip commented 2 years ago

Hi.

Thank you for creating such plugin. I tried to use it, but it seems that it somewhat works, I can see data in S3 bucket, but then memory usage starts increasing until eventually it gets killed.

Also there is interesting error in log:

2021-11-12T15:55:43.227Z [DEBUG] stdio: received EOF, stopping recv loop: err="rpc error: code = Unimplemented desc = unknown service plugin.GRPCStdio"

Then later:

2021-11-12T15:57:23.286Z [DEBUG] plugin process exited: path=/plugin/jaeger-s3 pid=17 error="signal: killed"

I tried version 1.1.4 with jaeger 1.28.0

tarvip commented 2 years ago

Longer log can be seen here: s3-plugin.log.gz

muhammadn commented 2 years ago

Hi @tarvip! Thank you for your feedback. Are you using this with jaeger all-in-one strategy or production strategy?

2021-11-12T20:05:07.183Z [DEBUG] stdio: received EOF, stopping recv loop: err="rpc error: code = Unimplemented desc = unknown service plugin.GRPCStdio" is normal. Just ignore that. I'm interested to know about the strategy you used and whether the configuration right.

Also can you add a tls_handshake_timeout configuration (0 means don't timeout)

muhammadn commented 2 years ago

Hi @tarvip! My guess is that you're running all-in-one strategy.

My initial finding is that by switch to production strategy helps reduce memory usage and in turn prevent the jaeger-s3 process from being killed.

tarvip commented 2 years ago

Actually I was running using production strategy, I have separate collector and query pod. Initially I tried with 1 collector pod, as it is fine when using Elasticsearch as storage, but I also tried with 4 collector pods, pods had no cpu limits (no throttling or whatsoever). When looking Elasticsearch indexing rate, it is about 40 ops/s.

muhammadn commented 2 years ago

@tarvip Thanks for the information. So you were running Elasticsearch along with jaeger-s3 side by side?

That's quite weird in the logs that tells the crash happened when you're writing to the object storage. For my case the fix i did earlier was on the reading part of object storage. I'll take a look when i can on this.

tarvip commented 2 years ago

So you were running Elasticsearch along with jaeger-s3 side by side?

No, I disabled Elasticsearch in collector when testing jaeger-s3 plugin. I also tried 1.1.3 a bit earlier, same issue, then I saw that 1.1.4 was also available.

muhammadn commented 2 years ago

So you were running Elasticsearch along with jaeger-s3 side by side?

No, I disabled Elasticsearch in collector when testing jaeger-s3 plugin. I also tried 1.1.3 a bit earlier, same issue, then I saw that 1.1.4 was also available.

I see. I am hoping to pinpoint the exact issue, is the high memory usage coming from jeager-query or collector? (you didn't mention it earlier) and were you querying from Jaeger-UI (or some UI like Grafana) at the same time?

The fixes i did was for object storage reads (queries). If you're talking about collector pod it means i missed something on the object storage writes (to s3).

I have not encountered the collector problem (my problem was jaeger-query was crashing because of cortex) but i will try talk to my team to stress test it on a busy cluster to collect and write tracing data to s3.

tarvip commented 2 years ago

I haven't enabled jaeger-s3 in query, I wanted to get trace writing to s3 working in collector first. Memory usage starts increasing pretty fast so it will crash quite soon (less than 1m) after startup. My suspicion is that it is unable to write traces to S3 fast enough and because of that memory usage starts growing.

muhammadn commented 2 years ago

@tarvip Yes, i had refactored the writer code a bit so there will only be one type of data for LogQL labels so it cuts down writer requests to s3 in version 1.2.0.

I've also made the code to follow loki data structure so you can do complex queries in loki with the jaeger data written by jaeger-s3.

Hope you will try it out and see if it fixes your collector issue. I didn't have a really busy kubernetes cluster to try it out because we're only tracing istio.

But i will look into this again because i will use the loki provided code (but you won't need loki, just internally, in jaeger-s3 i will leverage on loki) to write to s3 instead.

tarvip commented 2 years ago

@muhammadn thank you for your hard work. I haven't tried new version because I switched over to Grafana Tempo, Tempo has also search support now, also, I think it is more future proof.

muhammadn commented 2 years ago

@tarvip I am reopening this issue with a fix on the memory problem.

tarvip commented 2 years ago

@muhammadn sorry I have no time and resources to test it, we are already using Grafana Tempo and so far it works well.

muhammadn commented 2 years ago

@tarvip No problem. I'm close to ship a fix, we're not using tempo though but we're using kiali which uses jaeger.