apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37.11k stars 14.3k forks source link

ComputeEngineSSHHook on parallel runs in Composer gives banner Error reading SSH protocol banner #29258

Closed vasu2809 closed 1 year ago

vasu2809 commented 1 year ago

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

We are using ComputeEngineSSHHook for some of our Airflow DAGS in Cloud Composer

Everything works fine when DAGs run one by one

But when we start parallelism where multiple tasks are trying to connect to our GCE instance using ComputeEngineSSHHook at the same time,

We experience intermittent errors like the one give below

Since cloud composer by default has 3 retries, sometimes in the second or third attempt this issue gets resolved automatically but we would like to understand why this issue comes in the first place when there are multiple operators trying to generate keys and SSH into GCE instance

We have tried maintaining the DAG task with banner_timeout and expire_timeout parameters but we still see this issue

create_transfer_run_directory = SSHOperator( task_id="create_transfer_run_directory", ssh_hook=ComputeEngineSSHHook( instance_name=GCE_INSTANCE, zone=GCE_ZONE, use_oslogin=True, use_iap_tunnel=False, use_internal_ip=True, ), conn_timeout = 120, cmd_timeout = 120, banner_timeout = 120.0, command=f"sudo mkdir -p {transfer_run_directory}/" '{{ ti.xcom_pull(task_ids="load_config", key="transfer_id") }}', dag=dag, )

*[2023-01-31, 03:30:39 UTC] {compute_ssh.py:286} INFO - Importing SSH public key using OSLogin: user=edw-sa-gcc@pso-e2e-sql.iam.gserviceaccount.com [2023-01-31, 03:30:39 UTC] {compute_ssh.py:236} INFO - Opening remote connection to host: username=sa_115585236623848451866, hostname=10.128.0.29 [2023-01-31, 03:30:41 UTC] {transport.py:1874} ERROR - Exception (client): Error reading SSH protocol banner [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - Traceback (most recent call last): [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - File "/opt/python3.8/lib/python3.8/site-packages/paramiko/transport.py", line 2271, in _check_banner [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - buf = self.packetizer.readline(timeout) [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - File "/opt/python3.8/lib/python3.8/site-packages/paramiko/packet.py", line 380, in readline [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - buf += self._read_timeout(timeout) [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - File "/opt/python3.8/lib/python3.8/site-packages/paramiko/packet.py", line 609, in _read_timeout [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - raise EOFError() [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - EOFError [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - During handling of the above exception, another exception occurred: [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - Traceback (most recent call last): [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - File "/opt/python3.8/lib/python3.8/site-packages/paramiko/transport.py", line 2094, in run [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - self._check_banner() [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - File "/opt/python3.8/lib/python3.8/site-packages/paramiko/transport.py", line 2275, in _check_banner [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - raise SSHException( [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - paramiko.ssh_exception.SSHException: Error reading SSH protocol banner [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - [2023-01-31, 03:30:41 UTC] {compute_ssh.py:258} INFO - Failed to connect. Waiting 0s to retry [2023-01-31, 03:30:43 UTC] {transport.py:1874} INFO - Connected (version 2.0, client OpenSSH_8.9p1) [2023-01-31, 03:30:43 UTC] {transport.py:1874} INFO - Authentication (publickey) failed. [2023-01-31, 03:30:43 UTC] {compute_ssh.py:258} INFO - Failed to connect. Waiting 1s to retry [2023-01-31, 03:30:47 UTC] {transport.py:1874} INFO - Connected (version 2.0, client OpenSSH_8.9p1) [2023-01-31, 03:30:50 UTC] {transport.py:1874} INFO - Authentication (publickey) failed. [2023-01-31, 03:30:50 UTC] {compute_ssh.py:258} INFO - Failed to connect. Waiting 6s to retry [2023-01-31, 03:30:58 UTC] {transport.py:1874} INFO - Connected (version 2.0, client OpenSSH_8.9p1) [2023-01-31, 03:30:58 UTC] {transport.py:1874} INFO - Authentication (publickey) failed. [2023-01-31, 03:30:58 UTC] {taskinstance.py:1904} ERROR - Task failed with exception Traceback (most recent call last): File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/ssh/operators/ssh.py", line 157, in execute with self.get_ssh_client() as ssh_client: File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/ssh/operators/ssh.py", line 124, in get_ssh_client return self.get_hook().get_conn() File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/compute_ssh.py", line 232, in get_conn sshclient = self._connect_to_instance(user, hostname, privkey, proxy_command) File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/compute_ssh.py", line 245, in _connect_to_instance client.connect( File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/compute_ssh.py", line 50, in connect return super().connect(args, kwargs) File "/opt/python3.8/lib/python3.8/site-packages/paramiko/client.py", line 450, in connect self._auth( File "/opt/python3.8/lib/python3.8/site-packages/paramiko/client.py", line 781, in _auth raise saved_exception File "/opt/python3.8/lib/python3.8/site-packages/paramiko/client.py", line 681, in _auth self._transport.auth_publickey(username, pkey) File "/opt/python3.8/lib/python3.8/site-packages/paramiko/transport.py", line 1635, in auth_publickey return self.auth_handler.wait_for_response(my_event) File "/opt/python3.8/lib/python3.8/site-packages/paramiko/auth_handler.py", line 259, in wait_for_response raise e paramiko.ssh_exception.AuthenticationException: Authentication failed. [2023-01-31, 03:30:58 UTC] {taskinstance.py:1408} INFO - Marking task as UP_FOR_RETRY. dag_id=run_data_transfer_configs_dag, task_id=create_transfer_run_directory, execution_date=20230131T033002, start_date=20230131T033035, end_date=20230131T033058 [2023-01-31, 03:30:58 UTC] {standard_task_runner.py:92} ERROR - Failed to execute job 1418 for task create_transfer_run_directory (Authentication failed.; 21885)

What you think should happen instead

The SSH Hook operator should be able to seamlessly SSH into the GCE instance without any intermittent authentication issues

How to reproduce

No response

Operating System

Composer Kubernetes Cluster

Versions of Apache Airflow Providers

Composer Version - 2.1.3 Airflow version - 2.3.4

Deployment

Composer

Deployment details

Kubernetes Cluster GCE Compute Engine VM (Ubuntu)

Anything else

Very random and intermittent

Are you willing to submit PR?

Code of Conduct

vasu2809 commented 1 year ago

Edit :

Some steps that have been tried to resolve this issue when parallel SSH hook operators are running

  1. Increase MaxSessions and MaxStartups in sshd_config file

  2. provide banner timeout, connection timeout, command timeout values in SSH operator

    conn_timeout = 120,
    cmd_timeout = 120,
    banner_timeout = 90.0
  3. Expire_time value in ComputeEngineSSH hook operator ( expire_time = 120)

There are no errors observed in auth.log but the Airflow DAGs have similar issues as described in

https://github.com/paramiko/paramiko/issues/1135

Is there a need to provide disabled algorithms in the operator with Paramiko >=2.9 because that seems to be one of the solutions listed

vasu2809 commented 1 year ago

[2023-01-31, 05:13:35 UTC] {compute_ssh.py:236} INFO - Opening remote connection to host: username=sa_115585236623848451866, hostname=10.128.0.29 [2023-01-31, 05:13:37 UTC] {transport.py:1874} INFO - Connected (version 2.0, client OpenSSH_8.9p1) [2023-01-31, 05:13:37 UTC] {transport.py:1874} INFO - Authentication (publickey) failed. [2023-01-31, 05:13:37 UTC] {compute_ssh.py:258} INFO - Failed to connect. Waiting 1s to retry [2023-01-31, 05:13:40 UTC] {transport.py:1874} INFO - Connected (version 2.0, client OpenSSH_8.9p1) [2023-01-31, 05:13:40 UTC] {transport.py:1874} INFO - Authentication (publickey) failed. [2023-01-31, 05:13:40 UTC] {compute_ssh.py:258} INFO - Failed to connect. Waiting 1s to retry [2023-01-31, 05:13:43 UTC] {transport.py:1874} INFO - Connected (version 2.0, client OpenSSH_8.9p1) [2023-01-31, 05:13:43 UTC] {transport.py:1874} INFO - Authentication (publickey) failed. [2023-01-31, 05:13:43 UTC] {compute_ssh.py:258} INFO - Failed to connect. Waiting 3s to retry [2023-01-31, 05:13:48 UTC] {transport.py:1874} INFO - Connected (version 2.0, client OpenSSH_8.9p1) [2023-01-31, 05:13:48 UTC] {transport.py:1874} INFO - Authentication (publickey) failed. [2023-01-31, 05:13:48 UTC] {taskinstance.py:1904} ERROR - Task failed with exception Traceback (most recent call last): File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/ssh/operators/ssh.py", line 157, in execute with self.get_ssh_client() as ssh_client: File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/ssh/operators/ssh.py", line 124, in get_ssh_client return self.get_hook().get_conn() File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/compute_ssh.py", line 232, in get_conn sshclient = self._connect_to_instance(user, hostname, privkey, proxy_command) File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/compute_ssh.py", line 245, in _connect_to_instance client.connect( File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/compute_ssh.py", line 50, in connect return super().connect(*args, **kwargs) File "/opt/python3.8/lib/python3.8/site-packages/paramiko/client.py", line 450, in connect self._auth( File "/opt/python3.8/lib/python3.8/site-packages/paramiko/client.py", line 781, in _auth raise saved_exception File "/opt/python3.8/lib/python3.8/site-packages/paramiko/client.py", line 681, in _auth self._transport.auth_publickey(username, pkey) File "/opt/python3.8/lib/python3.8/site-packages/paramiko/transport.py", line 1635, in auth_publickey return self.auth_handler.wait_for_response(my_event) File "/opt/python3.8/lib/python3.8/site-packages/paramiko/auth_handler.py", line 259, in wait_for_response raise e paramiko.ssh_exception.AuthenticationException: Authentication failed. [2023-01-31, 05:13:48 UTC] {taskinstance.py:1408} INFO - Marking task as UP_FOR_RETRY. dag_id=run_data_transfer_configs_dag, task_id=create_transfer_run_directory, execution_date=20230131T051306, start_date=20230131T051332, end_date=20230131T051348 [2023-01-31, 05:13:48 UTC] {standard_task_runner.py:92} ERROR - Failed to execute job 1984 for task create_transfer_run_directory (Authentication failed.; 29045) [2023-01-31, 05:13:49 UTC] {local_task_job.py:156} INFO - Task exited with return code 1 [2023-01-31, 05:13:49 UTC] {local_task_job.py:279} INFO - 0 downstream tasks scheduled from follow-on schedule check

vasu2809 commented 1 year ago

@Taragolis Can we have somebody to assist on the banner timeout errors? Its coming inspite of maintaining banner timeout parameters in Compute Engine SSH Operator

Taragolis commented 1 year ago

@vasu2809 greetings!

Apache Airflow is a OSS project, so there are no specific people who could support or assist. Only people who bump into the same problem or know about a specific part of Airflow or specific provider.

This mostly refers to the issue with Google Provider and how it integrated with SSH Provider. Most possible that this hook doesn't propagate all settings to SSHOperator.

I would recommend investigating this first. A lot of stuff in providers could be fixed locally, by creating your own hooks/operators and overwriting buggy parts or extending missing capabilities. That's basically how I did in the past. However a long term solution - contributes back to Airflow.

I don't feel confident with GCP and Google Provider, all I can do from my side is mark this issue as 'good first issue' and maybe someone could pick it up.

potiuk commented 1 year ago

@Taragolis Can we have somebody to assist on the banner timeout errors? Its coming inspite of maintaining banner timeout parameters in Compute Engine SSH Operator

Also if you want assistance, then consider hiring a paid support. Airlfow (and this public forum is run by volunteers when when they have time.

VladaZakharova commented 1 year ago

Hi @vasu2809 ! Could you please share also the details of your Compute Engine instance you are using? Also, you mentioned that you were trying to connect to your Compute Engine in parallel with other ComputeEngineSSHHook tasks. Could you please also share what values you are using for use_oslogin, use_iap_tunnel and use_internal_ip parameters in those tasks? Will be helpful here. Thanks!