Open 1cadumagalhaes opened 4 months ago
Switching to KubernetesExecutor
might help resolve the issue of tasks being marked as zombie tasks, especially since your Airflow instance is running in a Kubernetes cluster.
Here's an example of how to define a DAG using DbtRunKubernetesOperator
:
from airflow import DAG
from airflow.utils.dates import days_ago
from cosmos.operators.kubernetes import DbtRunKubernetesOperator
default_args = {
'owner': 'airflow',
'start_date': days_ago(1),
}
with DAG(
dag_id='example_dbt_kubernetes_dag',
default_args=default_args,
schedule_interval='@daily',
catchup=False,
) as dag:
run_dbt_task = DbtRunKubernetesOperator(
task_id='run_dbt_task',
profile_config=None, # Add your profile config if needed
namespace='default', # Kubernetes namespace
image='your-dbt-image', # Docker image for dbt
cmds=["dbt", "run"], # dbt command
name='dbt-run-pod',
in_cluster=True,
is_delete_operator_pod=True,
get_logs=True,
)
run_dbt_task
Update your airflow.cfg
to use KubernetesExecutor
:
[core]
executor = KubernetesExecutor
Restart your Airflow services after making these changes. This should help mitigate the issue of tasks being marked as zombie tasks [1].
To continue the conversation, mention @dosu.
Which version of Cosmos are you using?
@1cadumagalhaes could you also confirm the Cosmos LoadMethod
you're using?
Sorry!
I'm using cosmos 1.3.2
@pankajastro
And I haven't set the LoadMethod, so it's automatic. I think it is rendering it using dbt_ls based on the logs @tatiana :
Unable to do partial parsing because saved manifest not found. Starting full parse.
Sometimes it is marked as zombie task even if it did execute correctly
Since this is not persistent, I feel dag is converted correctly so there might be an issue with the scheduler as you have pointed out. But do you see any unusual ti config for the zombie task? Also, we have released 1.4.3 which introduces some performance improvements I'm not sure if it will address this issue or not but would you like to try it once.
Yeah I believe this is a memory issue.
"Zombie tasks" is the error message you get on Astronomer if your celery worker sigkills.
This happens because Cosmos spins up subprocesses on Celery workers, and each of those subprocesses will exceed 100 MiB. On top of the overhead of the Celery worker itself, this adds up quite a bit and sigkills occur very frequently.
IMO, my 2 cents, we really need to add a more opinionated guide on this. Especially now with the invocation mode that runs dbt directly in the python process instead of spinning up a new subprocess, and now that dbt-core's dependency locks aren't incompatible with Airflow 2.9's, Astronomer users can easily avoid this. But there is no guide that tells people how to do that, never mind anything that opinionatedly tells users this. Cosmos is confusing and users are likely to do what they are told, so we should tell them to do the things that don't cause these sorts of issues.
@dwreeves, any (opiniated) suggestions in terms of required resources per worker/ tasks per worker/...?
@w0ut0 My most opinionated suggestion is to use the DBT_RUNNER
invocation mode. I'm sure the Astronomer people would like it if I suggested using bigger workers since that's more money for them 😂😅 (although real talk, if any Astronomer people are reading this, "high memory" workers would be a great feature!) but you can avoid a ton of overhead by not spawning multiple subprocesses.
Thanks, will test if it makes a significant improvement in our tasks without increasing the worker resources!
Not sure if this should be here, but it happens mostly with cosmos dags.
My airflow instance is in a kubernetes cluster, so this might be a problem with the scheduler, not with cosmos itself. I'm using CeleryExecutor right now, might switch these dags to KubernetesExecutor to see if it works.
But anyways, if anyone knows how to help it would be great
Sometimes it is marked as zombie task even if it did execute correctly