Closed jlaneve closed 1 month ago
This is small but a simple starting point would be to move to the LoggingMixin instead of a global logger. This would still be backwards compatible I believe, wouldn't change any behaviors, but at least provides opportunities to users to override logging behaviors, and is overall more idiomatic in Airflow world.
This issue was reported again by the community: https://apache-airflow.slack.com/archives/C059CC42E9W/p1715739144301439
This was the Slack message body in case we lose the above thread data
Hi everyone,
Actually i'm using astronomer-cosmos==0.6.2 and the dbt models run correctly in astro airflow, the only problem is that scheduler logs are bigger and bigger, and almost all the logs contains the same info (check the sample below) but as i mentioned before the process executes as expected, the diagram is correctly. Anyone knows how to avoid this or fix it because the log size of scheduler is almost ~5 Gb per day?
[2024-05-14T16:56:57.780+0000] {logging_mixin.py:154} INFO - [2024-05-14T16:56:57.780+0000] {render.py:214} ERROR - Dependency stg_talent_companies not found for model DbtModel(name='stg_goal_processes', type='DbtModelType.DBT_MODEL', path='/usr/local/airflow/dbt/ch_analytics/models/staging/stg_talent/stg_goal_processes.sql', config=DbtModelConfig(config_selectors=set(), upstream_models={'stg_talent_companies', 'talent_innpulsa'}))
Temptively adding this to the 1.5 milestone
Big post incoming
Users have been seeing duplicate logs and other logging related issues (see #639, #747. #906) for a while.
In my experience, logging issues have gotten a little out of hand. When running the dev container, I noticed that the logs were being tripled. That was new to me, and I hadn’t recalled that occurring the last time I worked with Cosmos. After running the main
branch of Cosmos with a completely blank slate, here are what my logs look like:
[2024-05-27T15:13:26.879-0400] {graph.py:76} INFO - (astronomer-cosmos) - Running command: `/Users/daniel/projects/astronomer-cosmos/.venv/bin/dbt deps --project-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp2qtqepf2 --profiles-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp3dwu7_qs --profile default --target dev`
[2024-05-27T15:13:26.879-0400] {graph.py:76} INFO - (astronomer-cosmos) - Running command: `/Users/daniel/projects/astronomer-cosmos/.venv/bin/dbt deps --project-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp2qtqepf2 --profiles-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp3dwu7_qs --profile default --target dev`
[2024-05-27T15:13:26.879-0400] {graph.py:76} INFO - (astronomer-cosmos) - Running command: `/Users/daniel/projects/astronomer-cosmos/.venv/bin/dbt deps --project-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp2qtqepf2 --profiles-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp3dwu7_qs --profile default --target dev`
[2024-05-27T15:13:31.086-0400] {graph.py:76} INFO - (astronomer-cosmos) - Running command: `/Users/daniel/projects/astronomer-cosmos/.venv/bin/dbt ls --output json --project-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp2qtqepf2 --profiles-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp3dwu7_qs --profile default --target dev`
[2024-05-27T15:13:31.086-0400] {graph.py:76} INFO - (astronomer-cosmos) - Running command: `/Users/daniel/projects/astronomer-cosmos/.venv/bin/dbt ls --output json --project-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp2qtqepf2 --profiles-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp3dwu7_qs --profile default --target dev`
[2024-05-27T15:13:31.086-0400] {graph.py:76} INFO - (astronomer-cosmos) - Running command: `/Users/daniel/projects/astronomer-cosmos/.venv/bin/dbt ls --output json --project-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp2qtqepf2 --profiles-dir /var/folders/t1/bd17h0253k38syfz01875yf40000gn/T/tmp3dwu7_qs --profile default --target dev`
There are a handful of issues causing this, but I think the simplest way to make some progress toward addressing the root problem would be to note how this stuff is normally done.
Let's say you do this:
# a/b/c.py
logger = logging.getLogger(__name__)
logger.info("Hello, world!")
The logger's namespace is a.b.c
, and the logger has no handlers attached, and propagate = True
.
This means that the logger.info()
call does basically nothing, since there are no handlers.
... Except, what I just said is not entirely true! Because of propagate = True
, the logs get passed up the chain of command. The logs in a.b.c
get passed to a.b
if there is a logger, and if not then a
, and if not then any and all loggers without a namespace at all, i.e. None
. This assumes all are propagating, which is true by default.
The reason logging.getLogger()
's propagate attribute is True
by default is because it is expected that you'll do something like, have a root logger (without a namespace), which has a StreamHandler()
or any other handler, and then a bunch of loggers without any Handlers at all will pass their logs up the chain to the root logger you set up. For example, the httpx
library uses a bunch of logging.info()
calls, but you don't see those until you set up your own root logger.
Full demo:
>>> import logging
>>> import httpx
>>> root = logging.getLogger()
>>> root.info("Hello world")
None
>>> # Nothing happens
>>> res = httpx.get("https://google.com/")
None
>>> root.addHandler(logging.StreamHandler())
>>> root.info("Hello world")
None
>>> root.setLevel(logging.INFO)
>>> root.info("Hello world")
Hello world
>>> # httpx use a logging.Logger w/o a handler, so it relies on our root logger's handler to print
>>> res = httpx.get("https://google.com/")
HTTP Request: GET https://google.com/ "HTTP/1.1 301 Moved Permanently"
>>> # Here we are intentionally creating a separate logger instance
>>> # With 2x stream handlers, this causes doubling of logs
>>> root_v2 = logging.getLogger()
>>> root_v2.addHandler(logging.StreamHandler())
>>> res = httpx.get("https://google.com/")
HTTP Request: GET https://google.com/ "HTTP/1.1 301 Moved Permanently"
HTTP Request: GET https://google.com/ "HTTP/1.1 301 Moved Permanently"
>>> # Note that if we assign a namespace to the logger, logs don't _triple_ though.
>>> namespaced_logger = logging.getLogger("a")
>>> namespaced_logger.addHandler(logging.StreamHandler())
>>> res = httpx.get("https://google.com/")
HTTP Request: GET https://google.com/ "HTTP/1.1 301 Moved Permanently"
HTTP Request: GET https://google.com/ "HTTP/1.1 301 Moved Permanently"
>>> # However in this case it will triple, since it uses the StreamHandler inside the above logger.
>>> namespaced_logger_v2 = logging.getLogger("a.b")
>>> namespaced_logger_v2.info("Hello world")
Hello world
Hello world
Hello world
For context: the justification for the way Cosmos's logging is currently set up is to add a colored (astronomer-cosmos)
in front of the logs inside the formatter object, which requires a StreamHandler object. This helps to make parsing the scheduler logs a little easier. That said, this has caused a lot of issues for users.
Under the hood, I would argue the reason Cosmos's logging is a little strange is because it doesn't seem to adhere to the idioms I've mentioned above in a few ways:
get_logger()
creates a stream handler, but it would make more sense to either create zero StreamHandlers (this is the typical approach, e.g. how httpx
approaches logging) or, at most exactly 1 StreamHandlers, and then rely on propagation to the root logger for all the Cosmos sub-modules.cosmos/__init__.py
and cosmos/cache.py
. This is actually the reason for the tripling of the logs: without a namespace, you have anywhere up to 3 different loggers that will log from any namespace. Each log in Cosmos should have a namespace.logging.Formatter("%(name)s.%(module)s - %(asctime)s - %(name)s.%(module)s:%(lineno)d - %(levelname)s - %(message)s")
would expose the fact that Cosmos is the entity sending out particular logging messages.I'm sorting these in order from least controversial to most controversial. They're also ordered in terms of "steps".
Loggers
must be namespaced + only the root logger gets a StreamHandler.There are a few ways to do this but the simplest is probably to just replace each one of these:
from cosmos.log import get_logger
logger = get_logger(__name__)
with:
import logging
logger = logging.getLogger(__name__)
Another option would be to do something like if name != "cosmos": ...
inside get_logger()
, but this is too different from the typical approach of just doing logging.getLogger(__name__)
.
Thus, the only logger that would get a StreamHandler is the one inside cosmos/__init__.py
, or alternatively wherever else you'd like to configure logger = logging.getLogger("cosmos")
.
This is essentially a refactor, but it makes for a nicer pattern going forward.
LoggingMixin
when appropriate.Basically, inside task instance executions, the LoggingMixin
makes more sense to use than the module scope logger objects.
Because of how namespacing works, just like step 1, this will ultimately just end up being mostly a refactor: each LoggingMixin's log name is self.__class__.__module__ + '.' + self.__class__.__name__
, so e.g. instead of cosmos.operators.local
, it becomes cosmos.operators.local.DbtLocalBaseOperator
. And then the namespace pairs down to cosmos
and that will ultimately create the StreamHandler.
This is where things get a little spicy. The above changes do nothing to fix the core issue of duplicate logs, they just refactor things to be cleaner and a teeny bit more idiomatic. But logs will still duplicate
The issue ultimately is that we want the scheduler logs to look nicer, but we don't want them to duplicate, and we also don't want the task instance logs to be impacted.
You have two options for this as far as I am concerned:
One way we can achieve this is, for any logs that run as part of the scheduler's work, we can:
getLogger("cosmos.dbt")
cosmos.dbt
loggercosmos.dbt
loggerThis ensures that logs do not get duplicated, but they do get written, and the custom logging stuff is isolated to the scheduler only.
As part of this, you'd also remove the StreamHandler
from the getLogger("cosmos")
logger.
There are a few other places in the code where the "scheduler" stuff occurs, in addition to cosmos.dbt
. There's also cosmos.converter
, for example. It's a little all over the place.
Each one of these places where logging is used by the scheduler should either get its own specific namespace, or alternatively each one should just be __name__
and get a StreamHandler:
# cosmos/dbt/__init__.py
from cosmos.log import get_scheduler_logger
logger = get_logger_for_scheduler(__name__)
# cosmos/dbt/graph.py
import logging
logger = logging.getLogger(__name__) # <- note: logs propagate to `cosmos.dbt`, so this is fine.
Lastly, assertions need to be made that logs are not being duplicated, basically.
This is a lot of extra complexity for (astronomer-cosmos)
to be appended to each log line. It's not unreasonable to just remove the feature entirely and handle logging more "typically". It is not my place to do that, it's mainly my place to provide feedback and then implement the repository owners' and community's vision, but I will throw it out there as a suggestion regardless.
Do you guys have a suggested workaround or fix we can implement for the time being? We managed to suppress node_converter warnings flooding the logs, but still every scheduler log line is repeated 3 times which makes debugging quite inconvenient. Also some of the scheduler logs from cosmos seem to show up in unrelated task execution logs, creating confusion.
@gbatiz #1047 puts a bandaid on the issue. In addition to setting cosmos.propagate_logs = False
in the config, you should be all set when the next alpha release of 1.5.0
comes out.
That said, I feel like it is still a bandaid. The more I think about it, the more I am in favor of scrapping the custom logging entirely, and ditching the whole (astronomer-cosmos)
thing. But this needs to occur with community input.
Thanks! Personally I also think it would be best to get rid of the whole thing. The small nicety is not worth the complexity.
I'd +1 removing it entirely too.
Closed in #1108
We add colors and other formatting to the logs that make it very difficult to read. This came up in the #airflow-dbt channel on the Airflow Slack: https://apache-airflow.slack.com/archives/C059CC42E9W/p1711624722823329
We should look at how we do logging today and potentially think about an overhaul. We should be able to remove the colors, duplicate timestamps, etc.