evanderkoogh / otel-cf-workers

An OpenTelemetry compatible library for instrumenting and exporting traces for Cloudflare Workers
BSD 3-Clause "New" or "Revised" License
238 stars 50 forks source link

Missing spans #58

Closed mattfysh closed 1 year ago

mattfysh commented 1 year ago

Reproduction: https://github.com/mattfysh/worker_do_otel

Looking into an issue with spans not appearing in honeycomb

The first is 100% reproducible, where if you make any number of followup request from Worker A to Worker B, only the spans from the initial call to Worker B appear. In this instance, honeycomb does not display the "Missing spans" warning.

image

The second is intermittent but highly-reproducible, where some spans from Worker B may appear as missing. In this instance, honeycomb displays the "Missing spans" warning.

You should be able to reproduce by deploying the linked repo, but let me know if you run into any issues or are unable to see the issue. I've also listed some smaller tangential issues in the repo readme.

evanderkoogh commented 1 year ago

Hey @mattfysh, I think that should be fixed now with 1.0.0-rc.13, but haven't had a chance to run it through your reproduction repos.. will see if I have some time to do that later tonight.

mattfysh commented 1 year ago

It's looking much better, thanks! The service bindings are auto-instrumented now which is great, however there is still the issue where subsequent calls to the same DO in a single request are not appearing:

image

I've updated my repro repo here: https://github.com/mattfysh/worker_do_otel/commit/5e76ffbddafa73ac8d82037955f3943d4680656b

evanderkoogh commented 1 year ago

Oh crap.. I think I may have found the issue with that.. That is a great catch. It is probably an issue in the BatchTraceSpanProcessor.. going to have a think on how to solve that :)

mattfysh commented 1 year ago

If it helps, adding a 1000ms delay between requests gets the subsequent calls to worker-b and DO to be instrumented (image below).

Using a 500ms delay and only the 1st and 3rd requests were fully instrumented

image

Commit: https://github.com/mattfysh/worker_do_otel/commit/d77e2df6ff1d0da697c45f0994d5749199c2c2b6

evanderkoogh commented 1 year ago

Yeah, that makes sense given the thing I am looking at. If you start a span in the DO, that is not a child of the first span that was created by it, before that sub tree is done exporting, you run into this problem.

evanderkoogh commented 1 year ago

I am always a bit skeptical when a change to the core logic of something so critical like the SpanProcessor works in one go, but look at this beauty!

Screenshot 2023-09-20 at 13 16 20
mattfysh commented 1 year ago

Thanks Erwin! I've never been more happy to see a fully populated trace 💯

I'm now repurposing the repro to implement an upstash-based drop in replacement for durable objects, because DOs are very slow in Australia.

If you need to continue using the repo for any reason, you can switch off this flag:

https://github.com/mattfysh/worker_do_otel/blob/cd1d85680f2a80ba2d799fc6e04d957ab430be6b/worker-b/index.ts#L9