apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.4k stars 14.11k forks source link

Using the @task.external_python() to call a task in a venv fails due to attempting to import airflow and other packages not in the external venv #40760

Closed taranlu-houzz closed 2 months ago

taranlu-houzz commented 2 months ago

Apache Airflow version

2.9.2

If "Other Airflow 2 version" selected, which one?

No response

What happened?

It seems like Airflow is trying to run the full dag script inside of the virtualenv, which does not have airflow or pendulum installed inside of it.

What you think should happen instead?

Based on the documentation, and other examples I have found online, I would expect it to only run the code inside of the decorated function within the virtualenv.

How to reproduce

Operating System

macOS: 13.6.7 (22G720)

Versions of Apache Airflow Providers

No response

Deployment

Docker-Compose

Deployment details

Customized Dockerfile:

Click to expand! ```dockerfile FROM apache/airflow:2.9.2 ENV PATH="/root/.local/bin:${PATH}" ENV TZ="America/Los_Angeles" ARG DEBIAN_FRONTEND="noninteractive" ARG HZ_WORKDIR="/home/airflow" WORKDIR ${HZ_WORKDIR} # ------------------------------------------------------------------------------------------------ # # NOTE: Need to use root due to how the airflow base is configured. USER root RUN apt update # Install base deps # NOTE: The `build-essential` lib has some .so that are needed by `bpy`. RUN apt install -y \ build-essential \ neovim # Install Python and pipx (default system python version: 3.10.6) RUN apt install -y \ pipx \ python3-venv # Install Blender runtime dependencies RUN apt install -y \ libegl1 \ libgl1-mesa-glx \ libsm6 \ libxfixes3 \ libxi-dev \ libxkbcommon0 \ libxrender1 \ libxxf86vm-dev USER airflow # ------------------------------------------------------------------------------------------------ # # NOTE: Each version of `bpy` supports a specific version of Python. ENV HZ_BPY_VERSION=4.1.0 ENV HZ_PYTHON_VERSION=3.11 # Install pipx, pdm, and the Blender compatible version of Python RUN pipx install pdm RUN pdm python install cpython@${HZ_PYTHON_VERSION} # Create Blender venv ENV HZ_VENV_PATH="${HZ_WORKDIR}/blender_venv" ENV HZ_VENV_PYTHON_PATH="${HZ_VENV_PATH}/bin/python" RUN \ python_path="$(pdm python list | sed -n 's/.*(\(.*\))/\1/p' | head -n 1)" && \ "${python_path}" -m venv "${HZ_VENV_PATH}" RUN \ python_path="$(pdm python list | sed -n 's/.*(\(.*\))/\1/p' | head -n 1)" && \ "${HZ_VENV_PYTHON_PATH}" -m pip install --upgrade pip setuptools RUN \ python_path="$(pdm python list | sed -n 's/.*(\(.*\))/\1/p' | head -n 1)" && \ "${HZ_VENV_PYTHON_PATH}" -m pip install bpy==${HZ_BPY_VERSION} # ------------------------------------------------------------------------------------------------ # # Silly macOS Docker workaround for incorrect /proc/cpuinfo COPY ./fakefopen.c ${HZ_WORKDIR}/ RUN cat /proc/cpuinfo >> fake_cpuinfo RUN echo "cpu MHz : 2345.678" >> fake_cpuinfo RUN gcc -Wall -fPIC -shared -o fakefopen.so fakefopen.c -ldl ENV LD_PRELOAD=${HZ_WORKDIR}/fakefopen.so ```

The fakefopen.c workaround (wouldn't need on an actual deployment):

Click to expand! ```c #define _GNU_SOURCE #define FAKE "/home/airflow/fake_cpuinfo" #include #include #include FILE *fopen(const char *path, const char *mode) { FILE *(*original_fopen)(const char*, const char*); original_fopen = dlsym(RTLD_NEXT, "fopen"); if(strcmp(path, "/proc/cpuinfo") == 0) { return (*original_fopen)(FAKE, mode); } else { return (*original_fopen)(path, mode); } } ```

The test dag:

Click to expand! ```python import os import pendulum from airflow.decorators import ( dag, task, ) HZ_VENV_PYTHON_PATH: str = os.environ.get("HZ_VENV_PYTHON_PATH") @dag( schedule=None, start_date=pendulum.today("UTC"), # start_date=pendulum.datetime(2021, 1, 1, tz="UTC"), catchup=False, tags=["blender", "test"], ) def blender_test(): """A basic test to use Blender via a virtualenv with bpy.""" @task.external_python( task_id="create_blend_file", python=HZ_VENV_PYTHON_PATH, ) def create_blend_file() -> str: """Create and save a simple blend file.""" import bpy out_file_path = "/tmp/monkey.blend" bpy.ops.mesh.primitive_monkey_add() bpy.ops.wm.save_as_mainfile(filepath=out_file_path) return out_file_path @task.external_python( task_id="read_blend_file_and_render", python=HZ_VENV_PYTHON_PATH, ) def read_blend_file_and_render(blend_file_path: str) -> str: """Read the blend file and render it.""" import bpy bpy.ops.wm.open_mainfile(filepath=blend_file_path) bpy.context.scene.render.image_settings.file_format = "PNG" output_file_path = "/tmp/monkey.png" bpy.context.scene.render.filepath = output_file_path bpy.ops.render.render(write_still=True) return output_file_path @task.bash def rename_render(render_file_path: str) -> None: """Use bash to rename the rendered png file.""" return f"mv {render_file_path} /tmp/monkey_renamed.png" blend_file_path = create_blend_file() render_file_path = read_blend_file_and_render(blend_file_path) rename_render(render_file_path) blender_test() ```

The is the log from the Airflow worker container that shows the error:

Click to expand! ``` BACKEND=redis DB_HOST=redis DB_PORT=6379 [2024-07-12T14:19:48.342-0700] {configuration.py:2087} INFO - Creating new FAB webserver config file in: /opt/airflow/webserver_config.py -------------- celery@fdf54261b43e v5.4.0 (opalescent) --- ***** ----- -- ******* ---- Linux-6.6.31-linuxkit-x86_64-with-glibc2.36 2024-07-12 14:19:58 - *** --- * --- - ** ---------- [config] - ** ---------- .> app: airflow.providers.celery.executors.celery_executor:0x2aaab7d9d490 - ** ---------- .> transport: redis://redis:6379/0 - ** ---------- .> results: postgresql://airflow:**@postgres/airflow - *** --- * --- .> concurrency: 16 (prefork) -- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker) --- ***** ----- -------------- [queues] .> default exchange=default(direct) key=default [tasks] . airflow.providers.celery.executors.celery_executor_utils.execute_command [2024-07-12 14:19:53 -0700] [72] [INFO] Starting gunicorn 22.0.0 [2024-07-12 14:19:53 -0700] [72] [INFO] Listening at: http://[::]:8793 (72) [2024-07-12 14:19:53 -0700] [72] [INFO] Using worker: sync [2024-07-12 14:19:53 -0700] [74] [INFO] Booting worker with pid: 74 [2024-07-12 14:19:53 -0700] [76] [INFO] Booting worker with pid: 76 [2024-07-12 14:20:02,573: WARNING/MainProcess] /home/airflow/.local/lib/python3.12/site-packages/celery/worker/consumer/consumer.py:508: CPendingDeprecationWarning: The broker_connection_retry configuration setting will no longer determine whether broker connection retries are made during startup in Celery 6.0 and above. If you wish to retain the existing behavior for retrying connections on startup, you should set broker_connection_retry_on_startup to True. warnings.warn( [2024-07-12 14:20:02,642: INFO/MainProcess] Connected to redis://redis:6379/0 [2024-07-12 14:20:02,648: WARNING/MainProcess] /home/airflow/.local/lib/python3.12/site-packages/celery/worker/consumer/consumer.py:508: CPendingDeprecationWarning: The broker_connection_retry configuration setting will no longer determine whether broker connection retries are made during startup in Celery 6.0 and above. If you wish to retain the existing behavior for retrying connections on startup, you should set broker_connection_retry_on_startup to True. warnings.warn( [2024-07-12 14:20:02,657: INFO/MainProcess] mingle: searching for neighbors [2024-07-12 14:20:03,697: INFO/MainProcess] mingle: all alone [2024-07-12 14:20:03,787: INFO/MainProcess] celery@fdf54261b43e ready. [2024-07-12 14:23:34,221: INFO/MainProcess] Task airflow.providers.celery.executors.celery_executor_utils.execute_command[6ec4e79c-3488-4a10-b99f-1c2b47bcbb35] received [2024-07-12 14:23:34,506: INFO/ForkPoolWorker-15] [6ec4e79c-3488-4a10-b99f-1c2b47bcbb35] Executing command in Celery: ['airflow', 'tasks', 'run', 'blender_test', 'create_blend_file', 'manual__2024-07-12T21:23:31.163038+00:00', '--local', '--subdir', 'DAGS_FOLDER/blender_test.py'] [2024-07-12 14:23:35,814: INFO/ForkPoolWorker-15] Filling up the DagBag from /opt/airflow/dags/blender_test.py [2024-07-12 14:23:50,109: INFO/ForkPoolWorker-15] Running on host fdf54261b43e Traceback (most recent call last): File "/home/airflow/.local/share/pdm/python/cpython@3.11.9/lib/python3.11/importlib/metadata/__init__.py", line 563, in from_name return next(cls.discover(name=name)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ StopIteration During handling of the above exception, another exception occurred: Traceback (most recent call last): File "", line 6, in File "/home/airflow/.local/share/pdm/python/cpython@3.11.9/lib/python3.11/importlib/metadata/__init__.py", line 1009, in version return distribution(distribution_name).version ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/airflow/.local/share/pdm/python/cpython@3.11.9/lib/python3.11/importlib/metadata/__init__.py", line 982, in distribution return Distribution.from_name(distribution_name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/airflow/.local/share/pdm/python/cpython@3.11.9/lib/python3.11/importlib/metadata/__init__.py", line 565, in from_name raise PackageNotFoundError(name) importlib.metadata.PackageNotFoundError: No package metadata was found for apache-airflow Traceback (most recent call last): File "", line 1, in ModuleNotFoundError: No module named 'pendulum' [2024-07-12 14:23:52,860: INFO/ForkPoolWorker-15] Task airflow.providers.celery.executors.celery_executor_utils.execute_command[6ec4e79c-3488-4a10-b99f-1c2b47bcbb35] succeeded in 18.6186027990002s: None ```

Anything else?

I am new to Airflow and am exploring it as an option for integration with our 3D pipeline. I imagine that I am just doing something incorrectly, but I haven't been able to figure out what is wrong, and as far as I can tell, I am doing thing pretty much the same way I have seen in other examples.

Are you willing to submit PR?

Code of Conduct

boring-cyborg[bot] commented 2 months ago

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

eladkal commented 2 months ago

It seems like Airflow is trying to run the full dag script inside of the virtualenv, which does not have airflow or pendulum installed inside of it.

It doesn't make much sense to use Airflow imports inside virtual env created to run a task. As for pendulum, if you need it then add it to the requirements of the env.

Converting to discussion as this is a technical question rather than a bug report.