airbytehq / PyAirbyte

PyAirbyte brings the power of Airbyte to every Python developer.
https://docs.airbyte.com/pyairbyte
Other
176 stars 20 forks source link

💡 Feature Request: Enable live logging when reading data #229

Open johanznsdn opened 1 month ago

johanznsdn commented 1 month ago

As discussed here: https://airbytehq.slack.com/archives/C06FZ238P8W/p1715109561242409

It would be very useful to have live logs for easy debugging. Deciding to enable them or not via a parameter, for example (source.read(cache=cache, logs=True)).

aaronsteers commented 1 month ago

@johanznsdn - Thanks for raising! I'll put some thoughts here for discussion...

I think we're going to want a few different logging "levers":

Distinct Loggers

I'm not set on these specifically, but it does seem like we'd want some granularity, such as could be provided by the below.

  1. airbyte - A top-level logger for PyAirbyte. Setting defaults here applies them also to the below.
  2. airbyte.source - Passthrough of STDERR log messages from the source.
  3. airbyte.read - Messages about the status of a read operation.
  4. airbyte.cloud - Messages about cloud functionality or API interop.
  5. airbyte.core - Anything else not covered in the above.

Log Levels

The standard logging levels are:

  1. ERROR
  2. WARNING
  3. INFO
  4. DEBUG

Log Writers / Handlers

The two classic options are:

  1. STDERR writer (terminal)
  2. Log file writer

Reasonable Defaults

On-By-Default Behaviors

Warnings in the Terminal (STDERR)

By default, I think we can turn everything on for the STDERR handler at the WARNING and above level. This is more-or-less how we have PyAirbyte configured today.

File-Per-Read covering airbyte.source (STDERR passthrough)

We don't capture or print the Source's log output as of now. We could add a file-based logger today without much downside. Basically, every time we invoke a source connector, we'd create a new file to store its log output. If anything fails, we can print the path to this file for the user's convenience.

Global Log File per Session

It might be worthwhile to consider a global file logger - for everything except the source log passthrough described above. This could be configured at the INFO level by default.

New Logged Events for read() processing

Everything up until now is relatively boring/obvious in terms of providing needed visibility and basic user control for logging operations. But now, a hard question is: "What should we log during read() operations?"

I'll start with events that are a more obvious "yes", followed by those that ware "maybes":

  1. The start of important events and phases, such as:
    1. When writing our first and last record from the source. ("Read started" and "Read complete".)
    2. When completing the loading of the stream to the cache. ("Write completed.)
  2. Per-stream start and completion events:
    1. Start and completion of "read" operations for each stream. (Not tracked as of now, but could be added.)
    2. Start and completion of cache "write" operations for each stream. (Tracked but not logged.)
  3. Metrics on # batches or records written. (This is harder to get right. Some sources commit only a few hundred batches over their duration and others may generate 100+ batches per second. If we could keep this info as DEBUG level and maybe not needed as of yet, it avoids having to calibrate noise-vs-signal ratio.)

Of the above, only 3 seems controversial or difficult to implement. Unfortunately, 1.1 is not currently available, but all others in bullets 1 and 2 would be pretty easy additions.

Path forward

@johanznsdn - Since you logged this issue, I would greatly appreciate your thoughts on what is most beneficial to you.

johanznsdn commented 1 month ago

@aaronsteers Thank you so much! Bullets 1 and 2 seem great to me!

aaronsteers commented 1 month ago

Great, thanks for confirming @johanznsdn. In that case, we'll focus on just those first two logging events, and (nicely) we don't need to worry (yet!) about 10s of thousands of log lines being generated. 😄

I can't commit to a timeline as of now, but I'm adding labels Accepting Pull Requests and Good First Issue in case anyone wants to pick this up. It's a frequent request so we'll try to prioritize it sooner than later...

cc @bindipankhudi for visibility

aaronsteers commented 4 days ago

@vspanxcode - You mentioned you'd like to pick this up for Hackathon. Drop a comment here and I'll go ahead and assign it to you.

vspanxcode commented 4 days ago

@aaronsteers please assign me this one

aaronsteers commented 4 days ago

✅ Done - thanks!

Lmk if you run into any issues or if you have any questions.