Closed muscovitebob closed 4 years ago
Thanks for opening your first issue here! Be sure to follow the issue template!
Thanks for reporting this issue. We are aware of this limitation and we believe it's already solved on the master. If you wish to use new operators please check https://github.com/apache/airflow#backport-packages
Here's a guide for the new operators: https://airflow.readthedocs.io/en/latest/howto/operator/google/cloud/bigquery.html
Thanks for getting back to me so quickly @turbaszek. I will try installing the back port package for Google operators on my Cloud Composer instance.
I seem to be unable to import from the back ports package having installed it for local testing.
(.venv) user@IMB dir % pip freeze
airflow-plugins==0.0.0
alembic==1.4.2
apache-airflow==1.10.11
apache-airflow-backport-providers-google==2020.6.24
...
(.venv) user@IMB dir % python
Python 3.7.7 (default, Mar 10 2020, 15:43:33)
[Clang 11.0.0 (clang-1100.0.33.17)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from airflow.providers.google.cloud.operators.bigquery import BigQueryExecuteQueryOperator
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'airflow.providers'
Is there a trick or step I am missing here?
I've installed the packages via python setup.py install
, my setup.py
looks like:
dependencies = [
"apache-airflow[gcp]~=1.10.6",
"pymongo~=3.10.1",
"google-cloud-bigquery~=1.25.0",
"google-cloud-storage~=1.25.0",
"apache-airflow-backport-providers-google==2020.6.24"
]
setuptools.setup(
install_requires=dependencies,
packages=setuptools.find_packages(),
python_requires=">=3.7",
version="0.0.0",
)
I see this is mentioned in the readme, actually. Looking into it.
I fixed this error by switching to a requirements.txt
file and installing these dependencies via pip install -r requirements.txt
, for now.
I see that the location
attribute is indeed correctly obeyed with the plug-and-play replacement for BigQueryOperator
in the backports package, BigQueryExecuteQueryOperator
, in my local testing.
And I am facing the same ModuleNotFoundError
after installing the backports package onto Cloud Composer via the standard PyPi package installation mechanism and attempting to import it in a DAG. Looking into how to perform the symlink from the readme on the Composer environment.
At the moment I have a scheduler pod and several Celery executor worker pods running in Composer. I have created a symlink inside the scheduler pod to Google's custom mounted Airflow version:
airflow@airflow-scheduler-7bd674b5d9-sb68h:/usr/local/lib/airflow/airflow$ pip freeze | grep airflow
# Editable install with no version control (apache-airflow===1.10.6-composer)
-e /usr/local/lib/airflow
apache-airflow-backport-providers-google==2020.6.24
airflow@airflow-scheduler-7bd674b5d9-sb68h:~$ cd /usr/local/lib/airflow/airflow
airflow@airflow-scheduler-7bd674b5d9-sb68h:/usr/local/lib/airflow/airflow$ sudo ln -s /opt/python3.6/lib/python3.6/site-packages/airflow/providers providers
This gets rid of the import warning, but I now get DAG seems to be missing.
on the GUI when I try to navigate to the DAG page. In the scheduler logs I see instances of the following uninformative error:
[2020-07-27 14:56:55,216] {dagbag.py:246} ERROR - Failed to import: /home/airflow/gcs/dags/dag.py
File "/home/airflow/gcs/dags/dag.py", line 6, in <module>
[2020-07-27 14:57:07,527] {dagbag.py:407} INFO - Filling up the DagBag from /home/airflow/gcs/dags/dag.py
There are no errors specific to the DAG in question in the workers far as I can see. I suspect that I also need to create the same symlink on the workers for this to work.
Of course this setup is rather brittle as the symlink will be destroyed with each pod rotation, especially common for the workers.
I know this is now a Cloud Composer specific issue, so I will migrate further exploration of making this work to the Cloud Composer User Group. Please see here for the discussion.
Closing as the reported problem itself seems to be solved
@muscovitebob please upgrade Cloud Composer to the latest version.
Old environments do not support these packages.
June 24, 2020 Airflow Providers can now be installed inside Cloud Composer.
Thanks for the heads up @mik-laj, I indeed had some success after upgrading as I mentioned in the linked Composer User Group post :)
Apache Airflow version:
composer-1.10.4-airflow-1.10.6
Kubernetes version (if you are using kubernetes) (use
kubectl version
):What happened and What you think went wrong:
BigQueryOperator
does not use thelocation
parameter in order to specify query job location. Instead, it retrieves the automatically determined location from the HTTP request.This happens because of the following code:
The
configuration
block does not contain a location. The subsequent call inquery_reply
apparently triggers some internal BigQuery logic to detect the location. This in practice falls back to US more often than not, leading to the job to quit with an error saying the datasets/tables referenced in the query do not exist. Specifying thelocation
argument, e.g.location='EU'
in the operator is thus not obeyed.What you expected to happen: Specifying
location
as aBigQueryOperator
argument leads to execution of the query job in the correct location.How to reproduce it: Set up a project and dataset in EU containing an example table.
Then, with an initialised local Airflow (
airflow initdb
) that has been supplied with GCP/BigQuery default connection details, you may run the following code:The location parameter will probably not be respected, instead your job will execute in EU.
Occasionally, regardless of
location
specified, your job will execute inUS
. This is difficult to reliably reproduce as it appears to be flaky and depend on which location the BigQuery service itself decided the query should run in.