jlaneve commented 6 months ago

We add colors and other formatting to the logs that make it very difficult to read. This came up in the #airflow-dbt channel on the Airflow Slack: https://apache-airflow.slack.com/archives/C059CC42E9W/p1711624722823329

We should look at how we do logging today and potentially think about an overhaul. We should be able to remove the colors, duplicate timestamps, etc.

dwreeves commented 6 months ago

This is small but a simple starting point would be to move to the LoggingMixin instead of a global logger. This would still be backwards compatible I believe, wouldn't change any behaviors, but at least provides opportunities to users to override logging behaviors, and is overall more idiomatic in Airflow world.

pankajkoti commented 4 months ago

This issue was reported again by the community: https://apache-airflow.slack.com/archives/C059CC42E9W/p1715739144301439

This was the Slack message body in case we lose the above thread data

Hi everyone,
Actually i'm using astronomer-cosmos==0.6.2 and the dbt models run correctly in astro airflow, the only problem is that scheduler logs are bigger and bigger, and almost all the logs contains the same info (check the sample below) but as i mentioned before the process executes as expected, the diagram is correctly. Anyone knows how to avoid this or fix it because the log size of scheduler is almost ~5 Gb per day?
[2024-05-14T16:56:57.780+0000] {logging_mixin.py:154} INFO - [2024-05-14T16:56:57.780+0000] {render.py:214} ERROR - Dependency stg_talent_companies not found for model DbtModel(name='stg_goal_processes', type='DbtModelType.DBT_MODEL', path='/usr/local/airflow/dbt/ch_analytics/models/staging/stg_talent/stg_goal_processes.sql', config=DbtModelConfig(config_selectors=set(), upstream_models={'stg_talent_companies', 'talent_innpulsa'}))

Temptively adding this to the 1.5 milestone

dwreeves commented 4 months ago

Big post incoming

Issue

Users have been seeing duplicate logs and other logging related issues (see #639, #747. #906) for a while.

In my experience, logging issues have gotten a little out of hand. When running the dev container, I noticed that the logs were being tripled. That was new to me, and I hadn’t recalled that occurring the last time I worked with Cosmos. After running the main branch of Cosmos with a completely blank slate, here are what my logs look like:

[2024-05-27T15:13:26.879-0400] {graph.py:76} INFO - (astronomer-cosmos) - Running command: `/Users/daniel/projects/astronomer-cosmos/.venv/bin/dbt deps --project-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp2qtqepf2 --profiles-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp3dwu7_qs --profile default --target dev`
[2024-05-27T15:13:26.879-0400] {graph.py:76} INFO - (astronomer-cosmos) - Running command: `/Users/daniel/projects/astronomer-cosmos/.venv/bin/dbt deps --project-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp2qtqepf2 --profiles-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp3dwu7_qs --profile default --target dev`
[2024-05-27T15:13:26.879-0400] {graph.py:76} INFO - (astronomer-cosmos) - Running command: `/Users/daniel/projects/astronomer-cosmos/.venv/bin/dbt deps --project-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp2qtqepf2 --profiles-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp3dwu7_qs --profile default --target dev`
[2024-05-27T15:13:31.086-0400] {graph.py:76} INFO - (astronomer-cosmos) - Running command: `/Users/daniel/projects/astronomer-cosmos/.venv/bin/dbt ls --output json --project-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp2qtqepf2 --profiles-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp3dwu7_qs --profile default --target dev`
[2024-05-27T15:13:31.086-0400] {graph.py:76} INFO - (astronomer-cosmos) - Running command: `/Users/daniel/projects/astronomer-cosmos/.venv/bin/dbt ls --output json --project-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp2qtqepf2 --profiles-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp3dwu7_qs --profile default --target dev`
[2024-05-27T15:13:31.086-0400] {graph.py:76} INFO - (astronomer-cosmos) - Running command: `/Users/daniel/projects/astronomer-cosmos/.venv/bin/dbt ls --output json --project-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp2qtqepf2 --profiles-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp3dwu7_qs --profile default --target dev`

Python logging idioms

There are a handful of issues causing this, but I think the simplest way to make some progress toward addressing the root problem would be to note how this stuff is normally done.

Let's say you do this:

# a/b/c.py
logger = logging.getLogger(__name__)
logger.info("Hello, world!")

The logger's namespace is a.b.c, and the logger has no handlers attached, and propagate = True.

This means that the logger.info() call does basically nothing, since there are no handlers.

... Except, what I just said is not entirely true! Because of propagate = True, the logs get passed up the chain of command. The logs in a.b.c get passed to a.b if there is a logger, and if not then a, and if not then any and all loggers without a namespace at all, i.e. None. This assumes all are propagating, which is true by default.

The reason logging.getLogger()'s propagate attribute is True by default is because it is expected that you'll do something like, have a root logger (without a namespace), which has a StreamHandler() or any other handler, and then a bunch of loggers without any Handlers at all will pass their logs up the chain to the root logger you set up. For example, the httpx library uses a bunch of logging.info() calls, but you don't see those until you set up your own root logger.

Full demo:

>>> import logging
>>> import httpx
>>> root = logging.getLogger()
>>> root.info("Hello world")
None

>>> # Nothing happens
>>> res = httpx.get("https://google.com/")
None

>>> root.addHandler(logging.StreamHandler())
>>> root.info("Hello world")
None

>>> root.setLevel(logging.INFO)
>>> root.info("Hello world")
Hello world

>>> # httpx use a logging.Logger w/o a handler, so it relies on our root logger's handler to print
>>> res = httpx.get("https://google.com/")
HTTP Request: GET https://google.com/ "HTTP/1.1 301 Moved Permanently"

>>> # Here we are intentionally creating a separate logger instance
>>> # With 2x stream handlers, this causes doubling of logs
>>> root_v2 = logging.getLogger()
>>> root_v2.addHandler(logging.StreamHandler())
>>> res = httpx.get("https://google.com/")
HTTP Request: GET https://google.com/ "HTTP/1.1 301 Moved Permanently"
HTTP Request: GET https://google.com/ "HTTP/1.1 301 Moved Permanently"

>>> # Note that if we assign a namespace to the logger, logs don't _triple_ though.
>>> namespaced_logger = logging.getLogger("a")
>>> namespaced_logger.addHandler(logging.StreamHandler())
>>> res = httpx.get("https://google.com/")
HTTP Request: GET https://google.com/ "HTTP/1.1 301 Moved Permanently"
HTTP Request: GET https://google.com/ "HTTP/1.1 301 Moved Permanently"

>>> # However in this case it will triple, since it uses the StreamHandler inside the above logger.
>>> namespaced_logger_v2 = logging.getLogger("a.b")
>>> namespaced_logger_v2.info("Hello world")
Hello world
Hello world
Hello world

Issues with Cosmos's logging

For context: the justification for the way Cosmos's logging is currently set up is to add a colored (astronomer-cosmos) in front of the logs inside the formatter object, which requires a StreamHandler object. This helps to make parsing the scheduler logs a little easier. That said, this has caused a lot of issues for users.

Under the hood, I would argue the reason Cosmos's logging is a little strange is because it doesn't seem to adhere to the idioms I've mentioned above in a few ways:

It creates multiple StreamHandlers. Every single invocation of get_logger() creates a stream handler, but it would make more sense to either create zero StreamHandlers (this is the typical approach, e.g. how httpx approaches logging) or, at most exactly 1 StreamHandlers, and then rely on propagation to the root logger for all the Cosmos sub-modules.
Some of the loggers don't have a namespace. Specifically the logger in cosmos/__init__.py and cosmos/cache.py. This is actually the reason for the tripling of the logs: without a namespace, you have anywhere up to 3 different loggers that will log from any namespace. Each log in Cosmos should have a namespace.
propagate_logs config option feels like a hack? The problem is that in some contexts, there are multiple streamhandlers printing somewhere, and in other cases there may not be.
It is more typically the responsibility of the formatter/handler to expose metadata such as where a log line is coming from. For example, the following log formatter logging.Formatter("%(name)s.%(module)s - %(asctime)s - %(name)s.%(module)s:%(lineno)d - %(levelname)s - %(message)s") would expose the fact that Cosmos is the entity sending out particular logging messages.

Solutions

I'm sorting these in order from least controversial to most controversial. They're also ordered in terms of "steps".

1. All `Loggers` must be namespaced + only the root logger gets a StreamHandler.

There are a few ways to do this but the simplest is probably to just replace each one of these:

from cosmos.log import get_logger

logger = get_logger(__name__)

with:

import logging

logger = logging.getLogger(__name__)

Another option would be to do something like if name != "cosmos": ... inside get_logger(), but this is too different from the typical approach of just doing logging.getLogger(__name__).

Thus, the only logger that would get a StreamHandler is the one inside cosmos/__init__.py, or alternatively wherever else you'd like to configure logger = logging.getLogger("cosmos").

This is essentially a refactor, but it makes for a nicer pattern going forward.

2. Use `LoggingMixin` when appropriate.

Basically, inside task instance executions, the LoggingMixin makes more sense to use than the module scope logger objects.

Because of how namespacing works, just like step 1, this will ultimately just end up being mostly a refactor: each LoggingMixin's log name is self.__class__.__module__ + '.' + self.__class__.__name__, so e.g. instead of cosmos.operators.local, it becomes cosmos.operators.local.DbtLocalBaseOperator. And then the namespace pairs down to cosmos and that will ultimately create the StreamHandler.

3. Decouple scheduler logs from task instance logs + treat these separately

This is where things get a little spicy. The above changes do nothing to fix the core issue of duplicate logs, they just refactor things to be cleaner and a teeny bit more idiomatic. But logs will still duplicate

The issue ultimately is that we want the scheduler logs to look nicer, but we don't want them to duplicate, and we also don't want the task instance logs to be impacted.

You have two options for this as far as I am concerned:

3a. StreamHandler for specific namespaces only

One way we can achieve this is, for any logs that run as part of the scheduler's work, we can:

Define some logger getLogger("cosmos.dbt")
Add a StreamHandler to the cosmos.dbt logger
Set propagate = [the config option] for the cosmos.dbt logger

This ensures that logs do not get duplicated, but they do get written, and the custom logging stuff is isolated to the scheduler only.

As part of this, you'd also remove the StreamHandler from the getLogger("cosmos") logger.

There are a few other places in the code where the "scheduler" stuff occurs, in addition to cosmos.dbt. There's also cosmos.converter, for example. It's a little all over the place.

Each one of these places where logging is used by the scheduler should either get its own specific namespace, or alternatively each one should just be __name__ and get a StreamHandler:

# cosmos/dbt/__init__.py
from cosmos.log import get_scheduler_logger

logger = get_logger_for_scheduler(__name__)

# cosmos/dbt/graph.py
import logging

logger = logging.getLogger(__name__)  # <- note: logs propagate to `cosmos.dbt`, so this is fine.

Lastly, assertions need to be made that logs are not being duplicated, basically.

3b. Just nuke it?

This is a lot of extra complexity for (astronomer-cosmos) to be appended to each log line. It's not unreasonable to just remove the feature entirely and handle logging more "typically". It is not my place to do that, it's mainly my place to provide feedback and then implement the repository owners' and community's vision, but I will throw it out there as a suggestion regardless.

gbatiz commented 3 months ago

Do you guys have a suggested workaround or fix we can implement for the time being? We managed to suppress node_converter warnings flooding the logs, but still every scheduler log line is repeated 3 times which makes debugging quite inconvenient. Also some of the scheduler logs from cosmos seem to show up in unrelated task execution logs, creating confusion.

dwreeves commented 3 months ago

@gbatiz #1047 puts a bandaid on the issue. In addition to setting cosmos.propagate_logs = False in the config, you should be all set when the next alpha release of 1.5.0 comes out.

That said, I feel like it is still a bandaid. The more I think about it, the more I am in favor of scrapping the custom logging entirely, and ditching the whole (astronomer-cosmos) thing. But this needs to occur with community input.

gbatiz commented 3 months ago

Thanks! Personally I also think it would be best to get rid of the whole thing. The small nicety is not worth the complexity.

ajsquared commented 3 months ago

I'd +1 removing it entirely too.

tatiana commented 1 month ago

Closed in #1108

astronomer / astronomer-cosmos

Simplify logging #906

Issue

Python logging idioms

Issues with Cosmos's logging

Solutions

1. All `Loggers` must be namespaced + only the root logger gets a StreamHandler.

2. Use `LoggingMixin` when appropriate.

3. Decouple scheduler logs from task instance logs + treat these separately

3a. StreamHandler for specific namespaces only

3b. Just nuke it?

astronomer / astronomer-cosmos

Simplify logging #906

Issue

Python logging idioms

Issues with Cosmos's logging

Solutions

1. All Loggers must be namespaced + only the root logger gets a StreamHandler.

2. Use LoggingMixin when appropriate.

3. Decouple scheduler logs from task instance logs + treat these separately

3a. StreamHandler for specific namespaces only

3b. Just nuke it?

1. All `Loggers` must be namespaced + only the root logger gets a StreamHandler.

2. Use `LoggingMixin` when appropriate.