getsentry / develop

https://develop.sentry.dev
Other
56 stars 225 forks source link

tracing: PII/correctness problems with including `transaction` in `tracestate` #425

Open lobsterkatie opened 3 years ago

lobsterkatie commented 3 years ago

This and https://github.com/getsentry/develop/issues/424 are a continuation of https://github.com/getsentry/develop/issues/410, to make sure points I brought up there don't get lost now that it's closed.

I and various others have raised concerns about including transaction name in data used for dynamic sampling. The two biggest are about PII and data correctness.

As alluded to in the "Freezing the Context" section of the docs about tracestate headers, transaction name is mutable until a transaction is sent to Sentry, while the contents of the tracestate header is not. If the transaction name changes after the tracestate value has been calculated (which only happens if it's about to be used, either in an outgoing HTTP header for the purposes of propagation, or in the envelope header of a transaction being sent to Sentry), the value in the header won't correctly match the transaction's actual data.

This has two consequences:

1) Filtering based on the final name (which is what users see in the UI) won't work. Users will try to filter on /users/:username/squirrelstats and nothing will happen, because the transactions' tracestate values will instead contain transaction: "/users/maisey/squirrelstats", transaction: "/users/charlie/squirrelstats", and so on and so forth. This seems to defeat the entire purpose of including transaction name to begin with.

2) Because in some frameworks (like Express), the transaction name starts out being the specific URL and ends up being the parameterized version, and because we know the revision might not happen until it's too late, we leave ourselves open to the possibility that the transaction name sent out in the tracestate might contain PII which would otherwise later be correctly scrubbed by event processors before any event data left the SDK. (Of course, if URLs now count as PII, we have bigger problems, seeing as we send the full path of either the current page or the current request as a matter of practice already. The concern was raised, though, so I'm including it here.)

We've thus far said that at least the first problem is "a risk we're willing to take," but the fact is it's not a risk. It's a known, predictable consequence of the way that at least the Express integration* is built. It would take some investigation to find out of the data we need for the final transaction name is available any earlier, but as it currently stands, any Express route handler which causes an outgoing HTTP request (to the database, say) will have its tracestate calculated too early and the value will be wrong.

*Quite possibly others as well. There are definitely other ones which change transaction name mid-stream, but it would take some work to figure out when they do that (relative to things which would cause the tracestate to be calculated and frozen) and whether or not that's fixable if the current timing is too late.

IMHO, if we want this feature to work reliably, we will have to invest resources to audit the SDKs to figure out which ones have this problem (and then of course actually do the work of fixing it). The only other options are to have an only-partially-functional feature, or to drop transaction name from the tracestate.

rhcarvalho commented 3 years ago

Good summary, @lobsterkatie! In our current model transaction names are mutable, would be hard to enforce immutability. Perhaps possible if we compute the tracestate only as late as when the transaction envelope is assembled? That would add an extra conceptual responsibility that is at odds with having custom user-provided transports.

lobsterkatie commented 3 years ago

In our current model transaction names are mutable, would be hard to enforce immutability. Perhaps possible if we compute the tracestate only as late as when the transaction envelope is assembled?

If no HTTP requests have gone out before the transaction is packed into an envelope, this is in fact when tracestate is calculated. But given that route handlers very often have to retrieve data from somewhere, and that "somewhere" is very often only accessible through HTTP...

rhcarvalho commented 3 years ago

Right, we most likely propagate before we capture the transaction.

dcramer commented 3 years ago

In all cases I'm familiar with the transaction name would be semi-final before any external span would be created.

The typical case I have in mind is as simple as:

There may be others, but this is a VERY common case, and solving this is good enough for me. We can always educate users either via docs or via the UI if behaviors change. I think its probably more important to recognize there will be multiple transaction names in a trace and handling that is likely more of a concern to the end-user.

lobsterkatie commented 2 years ago

The typical case I have in mind is as simple as:

  • WSGI/middleware captures PATH from general routing info
  • Framework level handler rewrites to abstract path
  • Normal routines begin

It's not clear that it's always possible to have it run in that order, though, and, as far as our Express integration goes, it does not currently run in that order - the transaction name isn't parameterized until the end of the transaction's lifecycle. Further, we generally tell people that our middleware should be outermost, meaning (depending on the framework) routing may or may not have yet happened. Even if it has, the parameterized route name isn't always exposed in a place we can get to it until later in the request/response cycle.

Unless that can be solved across the board, we're going to end up having a transaction-name filter which is only compatible with some SDKs/integrations and not others. (And at least right now, Express, which is by far and away the most common node backend, would fall into that "other" category.)