getsentry / team-sdks

A meta repository for tracking work across all SDK teams.
0 stars 0 forks source link

Cramer's Experience #64

Open smeubank opened 7 months ago

smeubank commented 7 months ago

Cramer's Experience

The Setup

Four services, all in TypeScript:

Running @sentry/node-experimental wherever possible, otherwise generally latest version of relevant SDKs.

The Objective

We want perfect instrumentation. That means:

The Numerous Issues

Traces are really fucking hard to instrument

We rely way too much on magical instrumentation, and while I know this is changing, it hasn't changed fast enough.

1) I'm using an experimental SDK because not only is it our future direction, but the auto instrumentation is superior.

2) Its not obvious how to actually use our SDK in many cases, because I can't cleanly always use OTel's documentation. For example, I don't ever create 'tracers', but instead call Sentry's startSpan. Our docs contain completely separate sections on how to actually implement tracing, and none of them are complete.

3) Sentry's implementation sometimes looks similar to OTel but isn't compatible. For example, startSpan has different args than OTel's implementation. Additionally I believe OTel flips the meaning of startSpan. We should change our terminology to avoid confusing developers.

4) startSpan requires a callback, but not all code is callback driven. For example, my CLI uses commander, which uses a standard hook pattern (e.g. beforeCommand and afterCommand). I can't easily use a callback to instrument that. If I try to use startInactiveSpan it actually doesn't associate with the current parent (at least per our docs) so that has even more issues.

// this does not work correctly at all because of the missing parent
program
  .name("peated")
  .description("CLI for assisting with Peated")
  .hook("preSubcommand", async (thisCommand, subcommand) => {
    (thisCommand as MutatedCommand).span = Sentry.startInactiveSpan({
      op: "command",
      name: `commander.${subcommand.name()}`,
    });
  })
  .hook("postAction", async (thisCommand) => {
    await shutdownClient();
    if ((thisCommand as MutatedCommand).span) {
      (thisCommand as MutatedCommand).span.end();
    }
  });```

5) Continuing traces is needlessly complicated. OTel's APIs aren't any better here. We need obvious APIs to get the current traceparent, and to pass in a traceparent. These should be available as high level interfaces and not have complex abstractions.

For example, to grab the current context:

```typescript
// pull out traceparent to forward to faktory job
const activeContext = {};
propagation.inject(context.active(), activeContext);

await client.job(jobName, args, { traceContext: activeContext }).push();

And injecting it is even worse:

const activeContext = propagation.extract(
    context.active(),
    args.length > 1
    ? (
        args[1] as {
            traceContext: {
            traceId: string;
            baggage: any;
            };
        }
        ).traceContext
    : {},
);

return context.with(activeContext, async () => {
    return await Sentry.startSpan(
        //...

6) Using op and name, I believe per OTel's recs, does not play nicely with Sentry. We group on name and ignore op, which means we create transaction summaries that have mixed operations and makes things less actionable.

    {
      op: "publish",
      name: `faktory.${jobName.toLowerCase()}`,
    },

Transaction is deprecated and y'all keep ignoring that we have an error monitoring product

scope.setTransaction binds event.transaction, which exists for very explicit reasons. It makes it much faster to triage an error as a human when the culprit is shown as myCoolRouteName than it does with openai.ts(lib:foo). Additionally that same transaction gives me reasonable aggregates if it happens on different routes, and in an ideal world, maps up to the same name as the span (meaning I can cross reference data).

https://peated.sentry.io/issues/5007350914/?query=is%3Aunresolved&referrer=issue-stream&stream_index=17

There are additional problems here that have continued to exist for fucking ages, such as needing to call two different APIs to update the current transaction name, and now we have deprecated setTransaction without implementing a valid replacement in the experimental SDK.

In general, transaction MUST exist on errors, and almost certainly should be tightly coupled to the span (in which the span should probably always exists, rather than having two discrete code paths).

SourceMaps still suck

I can't even be asked to try to spend more time to set these up because its never fun and the debug timelime often takes hours. Not sure if I have any feedback here other than this is very unfun and it'd be a lot easier if I could just publish .map files and Sentry can pick up the slack.

Trace explorer is useless

  1. Half the time I can't get our instrumentation right, and it requires perfection instrumentation.

  2. Even if I get instrumentation right, I will likely get sampled, and again, the trace explorer requires perfect instrumentation

Its broken by design, and more importantly, it makes me think our product doesn't work. If I were a real customer doing a POC, I would have used a competitor by this point because of how broken the product looks.

I need spans to debug JavaScript

Stack traces are useless more and more frequently, and spans give you the picture of inputs. That is, a valid trace and span tree tells me how an error happened, and is almost required to be able to debug modern javascript errors.

Except the span tree is buried in a trace explorer which is empty half the time or otherwise broken, and then error page itself is filled with upsells instead of just rendering the information that helps me debug the error.

Replays are not the answer, and should be reserved for outlier conditions that are harder to debug (let alone, they are not useful on backend errors in many situations).

My feed is filled with performance issues

I literally only care about errors 99% of the time. I want to seek out performance issues, not be forced to triage them non-stop.

trpc errors suck

https://peated.sentry.io/issues/4630767004/?query=is%3Aunresolved&referrer=issue-stream&stream_index=2

TRPCClientError
[
  {
    "received": "United States",
    "code": "invalid_enum_value",
    "options": [
      "Afghanistan",
      "Albania",
      "Algeria",
      "American Samoa",
      "Andorra",
      "Angola",
      "Anguilla",
      "Antarctica",
      "Ant...

So do Zod errors and a bunch of other shit.

https://peated.sentry.io/issues/5007350836/?query=is%3Aunresolved&referrer=issue-stream&stream_index=10

Breadcrumbs are useless in so many contexts

They end up with a loop of information because they're a buffer and not a tree. They need to have a hard sit down on why they continue to exist when a span tree is often the right solution to diagnostics.

For example:

for item in list:
  do_a_thing_that_gracefully_fails_and_logs()

Each of those will pollute breadcrumbs even tho most crumbs are only relevant to the specific function call instance.

wtf is Object.<anonymous>

https://peated.sentry.io/issues/4546294867/?query=is%3Aunresolved&referrer=issue-stream&stream_index=13

wtf is Object captured as exception with keys: data, error, internal, status

https://peated.sentry.io/issues/5010841858/?query=is%3Aunresolved&referrer=issue-stream&stream_index=14

Transactions are shit all over the place even with OTel's better instrumentation

https://peated.sentry.io/issues/5006626006/?query=is%3Aunresolved&referrer=issue-stream&statsPeriod=14d&stream_index=0

This should somehow have a route name or url or something, and it should have it because of OTel's superior instrumentation, but seemingly I'm missing a trace and I've got an absolute trash error.

### SDK API 🎯 
- [ ] Our docs are incomplete, unclear what is true from OTel and what you CAN do with our API 🎯 _`Should we expose more of OTel's API? And can we adopt a more exhaustive approach to documenting ours?`_
- [ ] we do not maintain the same args as OTel while some of our APIs are same/similar  `startSpan` 🎯 _`Similar to above: are we moving away from OTel too much?`_
- [ ] callbacks vs standard hook patterns problem 🎯 _`Again is this something we need to look at the OTel API and enable? Do we have a similar use case with our tools?`_
- [ ] Using `op` and `name`, I believe per OTel's recs, does not play nicely with Sentry. We group on `name` and ignore `op`, which means we create transaction summaries that have mixed operations and makes things less actionable. 🎯 _`can we get reference to specifics of OTel here?`_ 🎯 _`this leads to txn names and breaking error links and display name in Performance, detail below`_
- [ ] How to update txn name? 🎯 _`do we just need a setOpName or setTransactionName itself, i think we have a category list of APIs and names`_
- [ ] Transactions are shit all over the place even with OTel's better instrumentation** 🎯  _`need further investigation data quality`_
### Issues Platform
- [ ] Stack traces are useless more often 🐛 _`why are they useless? I think this comments get's into if STs are still relevnat or more often we NEED spans/traces`_
- [ ] error page itself is filled with upsells instead of just rendering the information that helps me debug the error.  🐛  _`issue details page declutter project`_ 🙏
- [ ] My feed is filled with performance issues  🐛 _`perf issue noise`_
- [ ] trpc errors suck  🐛 _`otel gives us intsrumention but just performance, meaning SDK data might not be best for errors`_
- [ ] Zod errors and a bunch of other shit.  🐛 ``
- [ ] Breadcrumbs are useless in so many contexts  🐛 _`i also always found breadcrumbs to be inconsistent and repetitive but customers like them supposedly`_
- [ ] wtf is Object.<anonymous>  🐛 _`is there anything we can do with anonymous? Can we inform the user better? Like with figma, detect and inform that is a platform issue`_