apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.47k stars 14.13k forks source link

ObjectStorage jinja template is converting GCS to a different path type #42528

Open nolan-redox opened 4 hours ago

nolan-redox commented 4 hours ago

Apache Airflow version

2.10.1

If "Other Airflow 2 version" selected, which one?

No response

What happened?

When using the new jinja templating in ObjectStorage, it seems that it's misinterpreting the protocol of the string and prepending a forward slash "/" to my GCS bucket name. Perhaps it's evaluating the protocol before rendering the jinja string.

What you think should happen instead?

When using jinja templating for variable in ObjectStorage, I'm getting the following: /gs:/redox-n-airflow-workspace-us-central1

rather than the correct variable name: gs://redox-n-airflow-workspace-us-central1

How to reproduce

import datetime
import os
import logging

import pendulum

from airflow.decorators import dag, task
from airflow.io.path import ObjectStoragePath
from airflow.models import Variable

logger = logging.getLogger(__name__)

@dag(
    dag_id='objectstorage_jinja_variable',
    schedule=None,
    start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
    catchup=False,
    tags=["testing"]
)
def dag_example():

    @task
    def print_variable(path: ObjectStoragePath):
        variable_path = Variable.get("gs_analytics_bucket")
        template_in_task_path = ObjectStoragePath("{{var.value.gs_analytics_bucket}}", conn_id="google_cloud_default")

        logger.debug(f"Jinja Path: {path}")
        logger.debug(f"Variable Path: {variable_path}")
        logger.debug(f"Template in Task Path: {template_in_task_path}")
        logger.debug(f"ENV VAR: {os.getenv('AIRFLOW_VAR_GS_ANALYTICS_BUCKET')}")

    base = ObjectStoragePath("{{var.value.gs_analytics_bucket}}", conn_id="google_cloud_default")
    print_variable(base)

dag_example()

Logging prints the following:

DEBUG - Jinja Path: /gs:/redox-n-airflow-workspace-us-central1 
DEBUG - Variable Path: gs://redox-n-airflow-workspace-us-central1
DEBUG - Template in Task Path: /{{var.value.gs_analytics_bucket}}
DEBUG - ENV VAR: gs://redox-n-airflow-workspace-us-central1

EDIT: I tested the theory that it's interpreting the protocol before rendering and I suspect that's the case given the output below. It wasn't the output I expected because it added google_cloud_default@ I'm guessing based on this str

hardcode_protocol = ObjectStoragePath("gs://{{var.value.gs_analytics_bucket}}", conn_id="google_cloud_default")
    print_variable(hardcode_protocol)
DEBUG - Jinja Path: gs://google_cloud_default@gs://redox-n-airflow-workspace-us-central1/

Operating System

linux amd64

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

Open to it, but have never contributed. The jinja templating is a bit magical to me and if it's happening in the wrong order, I'm unsure it's an easy enough fix for someone of my skill level.

Code of Conduct

boring-cyborg[bot] commented 4 hours ago

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

eladkal commented 1 hour ago

cc @bolkedebruin