ethercrow / opentelemetry-haskell

The OpenTelemetry Haskell Client https://opentelemetry.io
Other
65 stars 6 forks source link

How to egress directly to a Zipkin/Jaeger instance? #39

Open jkaye2012 opened 3 years ago

jkaye2012 commented 3 years ago

Hello,

I've been evaluating the possibility of using this library. Things seem to work when manually sending the log file to a Jaeger instance, but is it possible to configure the library to automatically stream the event log to a remote Jaeger instance? I think this would be required for any kind of serious use, and should probably be documented. I do see the ZipkinExporter in opentelemetry-extra, but it's unclear to me how one would use that.

If this is possible and we are able to get it working, I'd be happy to open a PR to document the necessary steps.

Thanks, Jordan

ethercrow commented 3 years ago

Hi Jordan!

I agree that this mode of operation is very important to production use. I did some experiments and was able to get it working with some caveats.

So the operational principle of this library is: the instrumented application writes to eventlog, another application reads the eventlog, does some processing and uploads tracing data to wherever you need.

An interesting moment here is that GHC runtime can write the eventlog into a pipe instead of a file. This way the restreaming application can start reading from that pipe immediately, while the instrumented application is still running. So in a typical enterprise setting you would have a docker image where both processes are running simultaneously, one producing eventlog and one consuming it and sending data to Zipkin for example.

Now to the caveats:

1) GHC runtime is not very eager to flush the eventlog, so you don't have any guarantees on what would be the delay between "event occurred" and "Zipkin was notified about the event"

2) GHC runtime has an API for flushing the eventlog explicitly that I wanted to use as a workaround for caveat 1, but it doesn't work. There is a MR by the great @bgamari to fix it here: https://gitlab.haskell.org/ghc/ghc/-/merge_requests/3073

3) When the instrumented application is using several cores (via -N or -N<explicit-number-of-cores-greater-than-1>) the eventlog portions from different cores are flushed independently and possibly at different points in time. As the result the restreaming application can observe events out of order, e.g. end span foo can come before begin span foo, because the work corresponding to span foo was started on one core and finished on another, but the second one has flushed its eventlog first. This caveat can be mitigated by running the instrumented application with +RTS -N1.

So if these inconveniences are acceptable for you, please give it a try! If not, I'll probably wait a bit more to see if GHC 9.0 fixes the flushing issue. If it does not, I'll pivot and add another mode of operation where the instrumented application sends trace data directly to a collector service. My previous library, lightstep-haskell already works like that, but is only compatible with one collector service, namely Lightstep.

Please let me know what you think and whether waiting for GHC 9.0 would be a viable option or you are locked into some particular version of GHC for the foreseeable future.

domenkozar commented 3 years ago

@ethercrow will GHC 9 also address (3) caveat?

ethercrow commented 3 years ago

@domenkozar not directly, but if we gain the ability to flush the eventlog regularly then it would be feasible for the restreaming application to merge the eventlog portions on the fly, effectively having a sorted stream of events. Does this make sense?

jkaye2012 commented 3 years ago

Sorry for the delayed response.

This does make sense. I think I'll probably wait to try things out until you're able to give this a shot with the newer GHC version.

One question I do have is about the docker setup that you mentioned - my understanding is that it's generally not recommended to run multiple processes within a single docker container (see the first paragraph here: https://docs.docker.com/config/containers/multi-service_container/). It feels like it would be better to somehow wire two (or more) containers together rather than run the "log reader" within the application's container. Thoughts on that?

Thanks, Jordan

ethercrow commented 3 years ago

In my experience having some auxiliary processes in a docker container has always been fine. I read this guidance as "avoid putting unrelated services into one container", not as a rule to only use one process.

dustin commented 3 years ago

Is the restreamer able to compensate for out-of-order events? I've have to reorder eventlog events to make sense of them before, so it's at least possible on a whole stream, though.

At the very least, the "end before start" thing should be somewhat straightforward, though it would still be confusing to see events attached to spans that have already ended.

dustin commented 3 years ago

An alternative would be writing code to process the events internally without having to coordinate processes. e.g.:

https://github.com/bgamari/ghc-eventlog-socket and https://github.com/mpickering/eventlog-live

A bit of FFI is required to get a handle on the log, but it would otherwise comfortably fit into the library as is by having the user spawn an event processor thread along with an output processor. It may seem slightly weird to process the events in something that's also creating events, but it should work just fine and be considerably easier to manage.

shlevy commented 2 years ago

Are we using flushEventLog here now?

shlevy commented 2 years ago

Is https://gitlab.haskell.org/ghc/ghc/-/wikis/event-log/live-monitoring usable?

m1-s commented 1 year ago

Whats the status here? Does anyone have a minimal example how this works with GHC 9?