MaterializeInc / materialize

The Cloud Operational Data Store: use SQL to transform, deliver, and act on fast-changing data.
https://materialize.com
Other
5.72k stars 466 forks source link

Improve tracing filter support for platform #19843

Closed guswynn closed 1 year ago

guswynn commented 1 year ago

Two main upgrades we should perform (see https://materializeinc.slack.com/archives/CM7ATT65S/p1686343764202809?thread_ts=1686336740.198459&cid=CM7ATT65S for context):

pH14 commented 1 year ago

Change the dynamic filtering to use LaunchDarkly, instead of http requests.

In my PoC using LD/ServerVars to manage the filtering works great, except for the awkward interim where tracing is initialized before the catalog, so we don't have access to the default value immediately on startup. It's also awkward to keep the existing --log-filter param when it can now be set via --system-parameter-default / clusterd also takes in a --log-filter... not quite sure how to handle those situations. Open to any suggestions!

cc @benesch

guswynn commented 1 year ago

@pH14 I think translating the --log-filter into a --system-parameter-default is fine, and if both are passed, ignore the --log-filter

pH14 commented 1 year ago

Hm.. that's also turning out to have some tricky/subtle behavior because --system-parameter-default sets the default value for a SystemVar, but if the SystemVar is already set to a non-default value, changing the default won't be picked up until the var is reset which feels non-obvious.

I kind of wonder if we want to deprecate --log-filter to be a real in-case-of-emergency-break-glass type knob that overrides the filter on startup (giving us the desired log level between init and building the catalog), but may be overwritten by LD /ServerVar updates at any time? Not sure...

pH14 commented 1 year ago

Eh... talking myself into it, I think --log-filter will realistically only be used in local dev, and setting the initial SystemVar default + being able to toggle it dynamically in SQL should be sufficient for most use cases. There can be weird edges (you set a dynamic value, restart envd with --log-filter and briefly see the right log lines before it re-syncs with the catalog and undoes the change) but hopefully few and far between.

guswynn commented 1 year ago

deprecating --log-filter seems totally reasonable to me!

benesch commented 1 year ago

I like that plan quite a bit, too. If you really need to set the log filter on startup and you can't get startup to proceed to the point where you can run ALTER SYSTEM or sync with LD and then restart, you can always edit the stash with the stash debug tool.

pH14 commented 1 year ago

What I'm leaning towards here is splitting the filtering into two components:

  1. mz_log_filter / mz_opentelemetry_filter system vars
  2. --startup-log-filter / --startup-opentelemetry-filter that apply before (1) is available. For envd this is the time until the catalog is accessed, for clusterd this is the time before it receives a compute/storage command with the updated filter system vars.

(1) makes it easy to manage the filters in the steady state, and (2) is for emergencies / folks who need to debug pre-catalog/rpc startup issues. We then deprecate and/or remove --log-filter, since it's an awkward middle ground that's not quite (1) nor (2) -- my vote is to just rip off the bandaid and remove --log-filter with an informative error message, but open to other softer approaches here.

I think this approach makes the nuances of filtering clear, and it also solves some tricky implementation-detail questions I've run into when trying to keep --log-filter around and when trying to only use the system vars. Open to any feedback on this setup!

benesch commented 1 year ago

That's an absolutely great idea. I love it.

On Thu, Jul 6, 2023 at 3:17 PM Paul Hemberger @.***> wrote:

What I'm leaning towards here is splitting the filtering into two components:

  1. mz_log_filter / mz_opentelemetry_filter system vars
  2. --startup-log-filter / --startup-opentelemetry-filter that apply before (1) is available. For envd this is the time until the catalog is accessed, for clusterd this is the time before it receives a compute/storage command with the updated filter system vars.

(1) makes it easy to manage the filters in the steady state, and (2) is for emergencies / folks who need to debug pre-catalog/rpc startup issues. We then deprecate and/or remove --log-filter, since it's an awkward middle ground that's not quite (1) nor (2) -- my vote is to just rip off the bandaid and remove --log-filter with an informative error message, but open to other softer approaches here.

I think this approach makes the nuances of filtering clear, and it also solves some tricky implementation-detail questions I've run into when trying to keep --log-filter around. Open to any feedback on this setup!

— Reply to this email directly, view it on GitHub https://github.com/MaterializeInc/materialize/issues/19843#issuecomment-1624191636, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXSIFLGSQUZ475BGTFH2TXO4FMDANCNFSM6AAAAAAZBDRYWE . You are receiving this because you were mentioned.Message ID: @.***>

pH14 commented 1 year ago

Rolled out EnvFilters + system vars, and https://github.com/MaterializeInc/materialize/pull/19973 to make sure our spans are well-behaved. All set here!

guswynn commented 1 year ago

thanks for pushing this through @pH14 !!