apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37.26k stars 14.33k forks source link

Azure Data Lake connection will not work for blob.core.windows.net domain #44228

Open vince-vanh opened 1 day ago

vince-vanh commented 1 day ago

Apache Airflow Provider(s)

microsoft-azure

Versions of Apache Airflow Providers

apache-airflow-providers-microsoft-azure 11.1.0

Apache Airflow version

2.9.2

Operating System

Ubuntu 22.04.4

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

What happened

Scenario: need to leverage Azure storage for Airflow remote logging.

Step 1 is verifying the connection works, so I'm using the operator ADLSListOperator as a test case. On the connector I have set the following properties: Azure Client ID: Azure Client Secret: Azure Tenant ID: Azure DataLake Store Name: <e.g. mystorageaccount>

The store name's fully qualified url is https://mystorageaccount.blob.core.windows.net/

I know the client id, secret, and tenant id are all valid. They match the credentials that successfully work against the storage account using the python operator and the azure.storage.blob library. If I try to leverage the ADLS Connection with ADLSListOperator from apache-airflow-providers-microsoft-azure (11.1.0), it fails. The error log seems to indicate it is trying to connect to the wrong domain - e.g. ConnectionError(MaxRetryError("HTTPSConnectionPool(host='none.azuredatalakestore.net'

The domain azuredatalakestore.net is for legacy azure storage accounts. New storage accounts cannot use this domain. All future storage accounts use blob.core.windows.net.

If anyone has successfully used the operator ADLSListOperator against a storage account hosted at blob.core.windows.net, I'd be curious to know the configuration used. The documentation and examples I've found are very sparse or inconsistent.

I've tried using connector types azure_data_lake (as described above) as well as types adls and wasb.

What you think should happen instead

I would exect ADLSListOperator to list files, but it times out. I assume because it is trying to connect to the wrong domain.

How to reproduce

  1. Create a valid azure storage account that uses the blob.core.windows.net domain - which should be all new storage accounts on Azure.
  2. Setup a azure_data_lake connection using valid client id, client secret, tenant id, and account name.
  3. Write a DAG that leverages the ADLSListOperator.

Anything else

Always. Hasn't worked successfully yet.

Are you willing to submit PR?

Code of Conduct

boring-cyborg[bot] commented 1 day ago

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.