estuary / flow

🌊 Continuously synchronize the systems where your data lives, to the systems where you _want_ it to live, with Estuary Flow. 🌊
https://estuary.dev
Other
584 stars 47 forks source link

Connector Log Levels #299

Open saterus opened 2 years ago

saterus commented 2 years ago

I've been doing a lot of debugging with our new logging system. This has exposed a gap in the Airbyte spec related to log level. There's no way to tell a connector what its log level should be. This means that changing a task's log level does not actually cause the connector log level to change. They are stuck at the default level.

Given that there's no way specified in the Airbyte spec, we are left in an uncertain position. We could try and raise an issue in the Airbyte repo with them (to which they may not respond or implement), or we can address this only in our own connectors (which means Airbyte-authored connectors are different, again).

Either way, it would be useful to allow setting the log level. We have a few options for doing this available to us:

  1. Add another extension attribute, estuary.dev/log-level, and allow setting it in a connector config file.
  2. Decide on a well-known environment variable to read in each connector.

In either case, the runtime could launch the connector with the appropriate log level. Changing the shard's log level would cause the connector to restart regardless, so I don't think we'd need to change it on the fly.

psFried commented 2 years ago

@saterus do you have an opinion on which of those two options we should use?

My own 2 cents is that a well-known environment variable (CONNECTOR_LOG_LEVEL) is something that we could both start using now, and also propose adding to the spec. Looking at the current spec, I'm not really seeing a benefit to putting in any of those files, especially since some applications may want to configure logging before attempting to parse them.

jgraettinger commented 2 years ago

A third option would be to use a field that's part of the connector's specific specification? This would maintain compatibility with Airbyte, makes it explicit whether a specific connector is going to even do anything with the value, and builds in flexible semantics and verification of what values may look like (are the INFO vs info? do they allow separate levels by module ? JSON schema can both describe and verify allowed forms) -- which could otherwise be confusing in practice given that it's not a day one built-in.

To be clear, this is a convention only, not a spec extension. It wouldn't be part of the ConfiguredCatalog definition or require a estuary.dev/log-level extension. Connectors would include it in their generated spec under a presumably conventional name, but even that isn't required.

saterus commented 2 years ago

A third option would be to use a field that's part of the connector's specific specification? [ ... ] To be clear, this is a convention only, not a spec extension.

This seems tricky to automate. If we want to control log levels at a shard level, we wouldn't necessarily want/expect the user to add these fields to a catalog yaml.

I'd expect to edit the shard spec directly to modify the log levels (like we do now, but this is another opportunity for a shortcut command) and have that value passed appropriately to the connector. That mechanism could be a well known config value or an env var, but avoiding standardization means we don't know how to pass it. Since any connector may use a number of variations on how they specify the log level ("log_level"/"log.level"/"CONNECTOR_LOG_LEVEL"/etc and INFO vs info), we force the user to manipulate the log levels via the catalog yaml. It isn't something we can easily thread through from the shard spec's level.

do you have an opinion on which of those two options we should use?

No, not a strong one. Either 1 or 2 seem equivalent to me. 1 is more explicit about when a connector would support log levels and can automatically enforce useful values, while 2 will neatly hide the logging concern from the user until they care about it (it won't show up in flowctl discover).

psFried commented 2 years ago

A third option ...

The problem that I see with this is that it adds complexity to setting the log level because you'd actually need to set it in two places. For example, if you set the connector's log level to debug, but left the shard's log level at info, then the connector's log messages would, perhaps unexpectedly, be filtered out.

But I think you could also look at that as a feature, because it enables you to set a more verbose log level on the shard without necessarily getting all that output from the connector. You could also see it as a feature to allow connectors to use different values, such as per-module log filters like moduleA=debug,moduleB=info,warn.

TBH, I'm feeling a little ambivalent about which way we go here. Standardizing on an env variable seems like it would make things easier for most users, who will likely be served just fine by the simpler modeling of log levels. But there's little hope of having all connectors actually respect that env var, and so I could see it being better to keep it explicit in the config. And there's always the option of using both methods, and allowing the log level in the config to override the one from the "standardized" value.

Overall, I'm slightly disposed toward setting a standard env var based on the log level of the shard, but I don't feel very strongly about it.

willdonnelly commented 2 years ago

A bit of context that I thought had already been mentioned here but apparently not: our Go Airbyte connector boilerplate already has some annotations which make the environment variable setting LOG_LEVEL=<whatever> work as desired, so many of our connectors (basically all of the captures, AFAIK) already support that.