apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.62k stars 14.18k forks source link

airflow jinja template render error #33694

Closed jaegwonseo closed 10 months ago

jaegwonseo commented 1 year ago

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

version 2.6.2

An error occurs when *.json is included in the parameters of BigQueryInsertJobOperator.

to_gcs_task = BigQueryInsertJobOperator(
        dag=dag,
        task_id='to_gcs',
        gcp_conn_id='xxxx',
        configuration={
            "extract": {
                 # The error occurred at this location.          
                "destinationUris": ['gs://xxx/yyy/*.json'],

                "sourceTable": {
                    "projectId": "abc",
                    "datasetId": "def",
                    "tableId": "ghi"
                },
                "destinationFormat": "NEWLINE_DELIMITED_JSON"
            }
        }
    )

error log

jinja2.exceptions.TemplateNotFound: gs://xxx/yyy/*.json

What you think should happen instead

According to the airflow.template.templater source : https://github.com/apache/airflow/blob/main/airflow/template/templater.py#L152

      if isinstance(value, str):
            if any(value.endswith(ext) for ext in self.template_ext):  # A filepath.
                template = jinja_env.get_template(value)
            else:
                template = jinja_env.from_string(value)
            return self._render(template, context)

In the Jinja template source, if the value ends with .json or .sql, an attempt is made to read the resource file by calling jinja_env.get_template.

How to reproduce

just call BigQueryInsertJobOperator with configuration what i added

Operating System

m2 mac

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

Code of Conduct

nathadfield commented 1 year ago

@jaegwonseo Thanks for logging this and I have been able to replicate it. Basically this means that we cannot extract a BigQuery table to JSON delimited files with the extension .json. As a workaround I think you should be able to extract the table if you don't specify the extension.

SamWheating commented 1 year ago

This is expected behaviour - the BigQueryInsertJobOperator will treat anything in one of the template fields (configuration, job_id, impersonation_chain, project_id,) ending with .sql, or .json as a reference to a local file, which it will then try to load.

This is defined here: https://github.com/apache/airflow/blob/46fa5a2743c0c864f5282abd6055c5418585955b/airflow/providers/google/cloud/operators/bigquery.py#L2697-L2706

I'm not sure if there's a preferred way around this, but I would just subclass the operator and remove the extension:

class BigQueryInsertJobOperatorNoTemplateExt(BigQueryInsertJobOperator):
  template_ext = []
nathadfield commented 1 year ago

@SamWheating I wonder if it might be possible to prevent the renderer from attempting to load a file for named keys in configuration; in this case destinationUris.

jaegwonseo commented 1 year ago

how about add no_template_fields parameter to BaseOperator constructor ?

in this case no_template_fields = (configuration.extract.destinationUris)

class BaseOperator(AbstractOperator, metaclass=BaseOperatorMeta):
    def __init__(
         xxx, 
         yyy,
         no_template_fields: Sequence[str] = ():

and skip rendering from https://github.com/apache/airflow/blob/main/airflow/template/templater.py#L152

nathadfield commented 1 year ago

@jaegwonseo Why not give it a try? I think we would still want the filenames in destinationUris to be rendered in case they look something like gs://my-project/my-bucket/{{ ds }}/*.json

SamWheating commented 1 year ago

I wonder if it might be possible to prevent the renderer from attempting to load a file for named keys in configuration; in this case destinationUris.

I think that this would get pretty complicated pretty fast, as a dict argument like configuration has no enforced structure, so we'd need some jq-like way of indicating which maybe-present fields should and should not be subjected to templating.

Additionally, we'd have to be really specific to only disable this on fields in which we know a user would never ever want to apply this sort of formatting. I think in the case of destinationURI (which accepts a list of URIs) there is a possible use-case for using a .json file rather than an inlined list of strings so I wouldn't want to disable this templating feature for everyone.

Thoughts?

potiuk commented 1 year ago

I think the simplest case would be to add a feature to disable templating engine - either for all fields or for specific fields or maybe just disable specific extensions. I think we could couple this with something that we should have for quite some time - i.e. ability to dynamically change list of templated fields and template extensions by overriding them in the instance of the task.

There is fundamentally no problem to disable or enable it selectively "per task". It could even be as simple as "template_fields", "template_ext" specified in the constructor of BaseOperator.

Ir's a little bit more complex than just overriding it in classes, it would also have to be overrideable in serialized form of dag - i.e. whe template_fields are rendered in the UI. But I do not think there is anything that would prevent us from doing it.

SamWheating commented 1 year ago

Also one more suggestion (not sure why I didn't think of this earlier today), I think that this simple fix allows you to selectively disable template fields and extensions on a task-by-task basis - since template_ext is an attribute on the operator class but not exposed in the __init__ method:

to_gcs_task = BigQueryInsertJobOperator(
        dag=dag,
        task_id='to_gcs',
        gcp_conn_id='xxxx',
        configuration={
            "extract": {
                "destinationUris": ['gs://xxx/yyy/*.json'],
                "sourceTable": {
                    "projectId": "abc",
                    "datasetId": "def",
                    "tableId": "ghi"
                },
                "destinationFormat": "NEWLINE_DELIMITED_JSON"
            }
        }
    )

to_gcs_task.template_ext = ()  # Don't try to load up .json or .sql files 

I tested it locally and it appeared to work just fine, both at run-time and when displaying the rendered-task-instances in the UI:

image

With this in mind, is there still need to surface overrides in the __init__ method? Or can we just recommend this as a workaround (and document it as such).

potiuk commented 1 year ago

Yeah. Interesting approach.. I can't think of a bad side effect of it - this should not only work for parsing + execution of tasks, but it should also serialize well for the UI and appears to not have unintended side effects.

Appears to be a solution that we never thought of, but was always possible. And it's super-appealing to have a problem that can be solved by documenting it. Let me summon a few people what they think @dstandish @uranusjr @hussein-awala @eladkal - WDYT ?

I find it deceptively simple, but also - quite surprisingly - pretty workable. We had a number of discussions on whether we should make all fields templateable by default etc. but seems that this simple trick might do the job and we could make it "official".

See https://github.com/apache/airflow/issues/33694#issuecomment-1692508666 comment by @SamWheating

potiuk commented 1 year ago

BTW. No wonder @SamWheating you have not thought about it. It seems no-one did for the last few years I was around. I kept on explaining the users how easy it is to just extend existing operators and modify template_fields and template_exts but it never occured to me, that we could modify it in the tasks during dag construction. Apparently - we can.

SamWheating commented 1 year ago

Any objection to just adding this to the documentation as a footnote under the Template Rendering section?

potiuk commented 1 year ago

I have absolutely No objections, even more, I would love that as an "official" way of modifying templated/template_ext fields on-the-flight - as long as we get a few more people say "well, indeed that does not seem to have some side effects" :D

eladkal commented 1 year ago

Didn't we have the exact same bug when we added json as templated ext for KubernetesPodOperator ? https://github.com/apache/airflow/issues/16922

uranusjr commented 1 year ago

Maybe the easiest way out would be to add a Literally wrapper that disables rendering? So you can write

BigQueryInsertJobOperator(
    ...,
    configuration={
        "extract": {
            "destinationUris": [Literally("gs://xxx/yyy/*.json")],
            ...,
        },
    },
)

Please do bikeshed on the name. I’m intentionally avoiding Literal to avoid confusion with typing.Literal.

potiuk commented 1 year ago

Please do bikeshed on the name. I’m intentionally avoiding Literal to avoid confusion with typing.Literal.

I like the idea.

michalsosn commented 11 months ago

How about LiteralValue? It appears that I have successfully implemented it (link). However, the true challenge is in ensuring its clear usage for users.