Closed jaegwonseo closed 10 months ago
@jaegwonseo Thanks for logging this and I have been able to replicate it. Basically this means that we cannot extract a BigQuery table to JSON delimited files with the extension .json
. As a workaround I think you should be able to extract the table if you don't specify the extension.
This is expected behaviour - the BigQueryInsertJobOperator
will treat anything in one of the template fields (configuration
, job_id
, impersonation_chain
, project_id
,) ending with .sql
, or .json
as a reference to a local file, which it will then try to load.
This is defined here: https://github.com/apache/airflow/blob/46fa5a2743c0c864f5282abd6055c5418585955b/airflow/providers/google/cloud/operators/bigquery.py#L2697-L2706
I'm not sure if there's a preferred way around this, but I would just subclass the operator and remove the extension:
class BigQueryInsertJobOperatorNoTemplateExt(BigQueryInsertJobOperator):
template_ext = []
@SamWheating I wonder if it might be possible to prevent the renderer from attempting to load a file for named keys in configuration
; in this case destinationUris
.
how about add no_template_fields parameter to BaseOperator constructor ?
in this case no_template_fields = (configuration.extract.destinationUris)
class BaseOperator(AbstractOperator, metaclass=BaseOperatorMeta):
def __init__(
xxx,
yyy,
no_template_fields: Sequence[str] = ():
and skip rendering from https://github.com/apache/airflow/blob/main/airflow/template/templater.py#L152
@jaegwonseo Why not give it a try? I think we would still want the filenames in destinationUris
to be rendered in case they look something like gs://my-project/my-bucket/{{ ds }}/*.json
I wonder if it might be possible to prevent the renderer from attempting to load a file for named keys in configuration; in this case destinationUris.
I think that this would get pretty complicated pretty fast, as a dict argument like configuration
has no enforced structure, so we'd need some jq-like way of indicating which maybe-present fields should and should not be subjected to templating.
Additionally, we'd have to be really specific to only disable this on fields in which we know a user would never ever want to apply this sort of formatting. I think in the case of destinationURI
(which accepts a list of URIs) there is a possible use-case for using a .json
file rather than an inlined list of strings so I wouldn't want to disable this templating feature for everyone.
Thoughts?
I think the simplest case would be to add a feature to disable templating engine - either for all fields or for specific fields or maybe just disable specific extensions. I think we could couple this with something that we should have for quite some time - i.e. ability to dynamically change list of templated fields and template extensions by overriding them in the instance of the task.
There is fundamentally no problem to disable or enable it selectively "per task". It could even be as simple as "template_fields", "template_ext" specified in the constructor of BaseOperator.
Ir's a little bit more complex than just overriding it in classes, it would also have to be overrideable in serialized form of dag - i.e. whe template_fields are rendered in the UI. But I do not think there is anything that would prevent us from doing it.
Also one more suggestion (not sure why I didn't think of this earlier today), I think that this simple fix allows you to selectively disable template fields and extensions on a task-by-task basis - since template_ext
is an attribute on the operator class but not exposed in the __init__
method:
to_gcs_task = BigQueryInsertJobOperator(
dag=dag,
task_id='to_gcs',
gcp_conn_id='xxxx',
configuration={
"extract": {
"destinationUris": ['gs://xxx/yyy/*.json'],
"sourceTable": {
"projectId": "abc",
"datasetId": "def",
"tableId": "ghi"
},
"destinationFormat": "NEWLINE_DELIMITED_JSON"
}
}
)
to_gcs_task.template_ext = () # Don't try to load up .json or .sql files
I tested it locally and it appeared to work just fine, both at run-time and when displaying the rendered-task-instances in the UI:
With this in mind, is there still need to surface overrides in the __init__
method? Or can we just recommend this as a workaround (and document it as such).
Yeah. Interesting approach.. I can't think of a bad side effect of it - this should not only work for parsing + execution of tasks, but it should also serialize well for the UI and appears to not have unintended side effects.
Appears to be a solution that we never thought of, but was always possible. And it's super-appealing to have a problem that can be solved by documenting it. Let me summon a few people what they think @dstandish @uranusjr @hussein-awala @eladkal - WDYT ?
I find it deceptively simple, but also - quite surprisingly - pretty workable. We had a number of discussions on whether we should make all fields templateable by default etc. but seems that this simple trick might do the job and we could make it "official".
See https://github.com/apache/airflow/issues/33694#issuecomment-1692508666 comment by @SamWheating
BTW. No wonder @SamWheating you have not thought about it. It seems no-one did for the last few years I was around. I kept on explaining the users how easy it is to just extend existing operators and modify template_fields and template_exts but it never occured to me, that we could modify it in the tasks during dag construction. Apparently - we can.
Any objection to just adding this to the documentation as a footnote under the Template Rendering section?
I have absolutely No objections, even more, I would love that as an "official" way of modifying templated/template_ext fields on-the-flight - as long as we get a few more people say "well, indeed that does not seem to have some side effects" :D
Didn't we have the exact same bug when we added json as templated ext for KubernetesPodOperator
?
https://github.com/apache/airflow/issues/16922
Maybe the easiest way out would be to add a Literally
wrapper that disables rendering? So you can write
BigQueryInsertJobOperator(
...,
configuration={
"extract": {
"destinationUris": [Literally("gs://xxx/yyy/*.json")],
...,
},
},
)
Please do bikeshed on the name. I’m intentionally avoiding Literal
to avoid confusion with typing.Literal
.
Please do bikeshed on the name. I’m intentionally avoiding Literal to avoid confusion with typing.Literal.
I like the idea.
How about LiteralValue
?
It appears that I have successfully implemented it (link).
However, the true challenge is in ensuring its clear usage for users.
Apache Airflow version
Other Airflow 2 version (please specify below)
What happened
version 2.6.2
An error occurs when *.json is included in the parameters of BigQueryInsertJobOperator.
error log
What you think should happen instead
According to the airflow.template.templater source : https://github.com/apache/airflow/blob/main/airflow/template/templater.py#L152
In the Jinja template source, if the value ends with .json or .sql, an attempt is made to read the resource file by calling jinja_env.get_template.
How to reproduce
just call BigQueryInsertJobOperator with configuration what i added
Operating System
m2 mac
Versions of Apache Airflow Providers
No response
Deployment
Official Apache Airflow Helm Chart
Deployment details
No response
Anything else
No response
Are you willing to submit PR?
Code of Conduct