Add debuggability feature for specific request with `debug` key

erickhun commented 4 years ago

Adding a "debuggability" key to make it easier to engineers to see what happen with specific requests. Discussion here

Why?

Engineers will be able to add , and no matter the level of verbosity set (with LOG_LEVEL) in the services. The logs will appear in datadog logs with the value engineers arbitrarily give. For example:

with https://service.com?debug=A_RANDOM_STRING

Engineers will be able to see all logs across services if they search A_RANDOM_STRING in datadog logs , so they can isolate that given request, and remove noise.

This feature will need to be implemented in the BuffLog javacript library too to make sure to make the most of it.

Review

Up to have your eyes on this @esclapes @colinscape ? ( and cc @josem @karelm since I think you'll certainly make a use of that feature with the publish-api x) )

note: This is will be merge into #13

karelm commented 4 years ago

Heya @erickhun, I think that Colin and Edu both raised questions around this that resonated with me, but I thought I'd ask if there was any intentions to move this forward, or whether this was just going to remain paused for the time being?

erickhun commented 4 years ago

Since it is showing interest, let's try to see if that worth pushing ! @colinscape @esclapes @karelm

When do we want that feature?

I think we can prfioritize that feature if we see the need of it. We left it in pause since we didn't really see anyone expressing they needed it. Is it something that will be useful now? Any example on how it will make things easier for you?

Security aspect:

I'm unsure about the best solution solve the risk of "attacker" using that key, and us ending up with tons of unwanted logs. 1 way would be to add a variable environment defining the "debug-ability key". Something such as LOG_DEBUG_KEY=my_secret_debug_key and that feature will be trigger by https://service.com?my_secret_debug_key=value. But we would end up with all services that will need that key to show be able to show the log? The other way @colinscape mentioned, is a VPN check. The challenge I'm seeing is that we will need to specify the IP somewhere , and know how that IP is sent from service to services inside the cluster?

If anyone has a better idea :)

RPC calls: where will be this parameter ?

Thinking about this, I was considering our buffer-rpc calls that use HTTP query params for method and args. In this case, where do you see the new debug query parameter being included?:

inside args with the rest of method arguments,

or a top level, next to method and args

@esclapes I was thinking inside args with the rest of method arguments, so it consistent if that will be if we trigger that from a browser. I'm not too sure how RPC calls are made for other service, open to change that if that make more sense .

harder to set a custom header when using the browser.

As you said I think it's harder to set an HTTP header, me assuming that most of the case will be trigger in the browser?

colinscape commented 4 years ago

Realistically, I don't think the security aspect is too important - the chances of someone using that are pretty minimal, and if it does happen I'd say it should be pretty straightforward to add the secret like you mention @erickhun - so that to debug a single request you'd need to know the secret (and that would be relatively simple to change ideally with a single change in kubesecret and corresponding change in the DD indexing rules).

I terms of the usefulness, I'd defer to the product engineers. I can also see value in a mechanism like this (once we have fuller logging in place) to be able to see potential areas for improvement. For example, it might be enlightening to see how many database calls are made on a page load 😅 . Or how many external http requests are made.

karelm commented 4 years ago

Thanks for this @erickhun I think that the secret key approach would be nice, the only small issue we have with publish is the RPC request in the first place... but i can imagine with the new graphql endpoint this could be really useful to trace how a single request fans out from the service. I'm not too bothered about rpc events relative to other more useful features in future. If we can make those rpc events also behave the same way, then great, like Colin said it would be really interesting to see how many db call's are made just to render a page. This will hopefully lead to easy win's for us to improve performance overall for the website + show us our bottlenecks more clearly.

esclapes commented 4 years ago

Hey @erickhun @colinscape answering to some of the questions:

I don't think the security aspect is too important - the chances of someone using that are pretty minimal

I agree. :100: I see the main issue on how to really trigger this properly, like @karelm shared:

I think that the secret key approach would be nice, the only small issue we have with publish is the RPC request in the first place

For instance, the first GET request we make is the one that we would trigger by using the browser's URL bar and, thus, the one that you could modify with this extra parameter. However, this request would only load the Single Page App (HTML with React app) while most of the "interesting" requests are triggered from the app as AJAX requests (i.e. RPC to getFeatures).

To get this extra parameter sent in the "interesting" requests, you would need the app (or the underlying http client) to check the URL parameters in the browser and decide if it needs to include it in every subsequent request. If this is a secret key, how does the single page app know it? It would probably need some pattern matching and then let the back-end double-check it.

@esclapes I was thinking inside args with the rest of method arguments, so it consistent if that will be if we trigger that from a browser.

I would approach this as two separate questions:

how we trigger the debug mode
how we transmit the debug mode between services

For transmitting, I think a header would be the most semantic.

Regarding triggering, I have a thought experiment for you (if you have the time :smirk: )

I'd love for us to do a thought experiment and look at this from a different angle: Instead of making this an on-demand feature, we turn it around and make it our default: Instead of linking requests across services when a magic parameter is included, we could just do it all the time and use a convention that would work for our different layers

We already have several examples provided by our third party tools that we can use as inspiration:

DataDog offers a trace_id which has some magic baked in as the it is magically shared across services and builds our APM traces. However, traces are sampled and the client is not aware of the trace ID corresponding to their request, so it wouldn't fully solve this issue.
Cloudflare, on the other hand, generates a cf-request-id which is provided to our services as a header and it's returned to the client as a reponse header

The basic steps are to build our own would be:

Generate a x-buffer-request-id at the edge (api-gateway, rpc server, publish-api etc) and include it as a header to upstream services. We could easily use cf-request-id if available and only generate our own whenever we don't have it
Update our rpc-client and other libraries used for inter-service communication to take this header and relay it upstream in every request. Also update Bufflog to include this header.
Include this x-buffer-request-id as a response header so that we can just use it to search for our request in logs.

I think that the harder part is the second bullet (updating internal communication libraries), but reading through our options above it feels to me that we would need to do that anyways with the ad-hoc request identification and this always-on approach would have some extra benefits:

No need to handle secrets and magic parameter. Being always on means that as a developer you can just use the app as usual and be able to search for any request.
The x-buffer-request-id could be shown to customers on some specific errors to make it easier for them to report issues and for us to debug them. I have already seen advocates using c-xray shared by customer in their reports.

If I am understanding the original goal of this we would miss a key part of it, the always-on approach would not solve for increasing verbosity level in all services, to address this I would consider the following:

Increasing the verbosity may not always be necessary: in many cases the issues will be logged as errors or warnings, which are logged and indexed by default, so being able to search for a request could be enough.
We would still need a solution when more verbosity level is required. Maybe a magic parameter like proposed above would be the most appropriate but maybe a command via kubectl could also work to avoid it being exposed to the public. In any case, I think this would complement the always-on request tracing instead of being an alternative.

I am sorry for the long post :sweat_smile: and specially if this does not match the current direction. I just wanted to share this idea as I feel it could solve some of the use cases and I would personally find it very useful.

Thank you for reading and feel free to go ahead with the current approach :+1: :hugs:

colinscape commented 4 years ago

Amazing thoughts, @esclapes - thanks so much for taking the time to share them! 🤗

I do like the idea of customers being able to give some sort of trace id for us to use (even potentially create a tool for advocates to 'download' the full trace for the customer and save for later analysis in a bug report, for example).

Another possibility is that maybe production doesn't provide that ability to increase the log level on a per-request basis. Instead perhaps a different ('logging'?) environment has that built in and logs at a more verbose level by default. 🤔 We would then only make the request in the different environment when we want a deluge of logs for a particular request. 😀

bufferapp / php-bufflog