Closed vasu2809 closed 1 year ago
Edit :
Some steps that have been tried to resolve this issue when parallel SSH hook operators are running
Increase MaxSessions and MaxStartups in sshd_config file
provide banner timeout, connection timeout, command timeout values in SSH operator
conn_timeout = 120,
cmd_timeout = 120,
banner_timeout = 90.0
Expire_time value in ComputeEngineSSH hook operator ( expire_time = 120)
There are no errors observed in auth.log but the Airflow DAGs have similar issues as described in
https://github.com/paramiko/paramiko/issues/1135
Is there a need to provide disabled algorithms in the operator with Paramiko >=2.9 because that seems to be one of the solutions listed
[2023-01-31, 05:13:35 UTC] {compute_ssh.py:236} INFO - Opening remote connection to host: username=sa_115585236623848451866, hostname=10.128.0.29 [2023-01-31, 05:13:37 UTC] {transport.py:1874} INFO - Connected (version 2.0, client OpenSSH_8.9p1) [2023-01-31, 05:13:37 UTC] {transport.py:1874} INFO - Authentication (publickey) failed. [2023-01-31, 05:13:37 UTC] {compute_ssh.py:258} INFO - Failed to connect. Waiting 1s to retry [2023-01-31, 05:13:40 UTC] {transport.py:1874} INFO - Connected (version 2.0, client OpenSSH_8.9p1) [2023-01-31, 05:13:40 UTC] {transport.py:1874} INFO - Authentication (publickey) failed. [2023-01-31, 05:13:40 UTC] {compute_ssh.py:258} INFO - Failed to connect. Waiting 1s to retry [2023-01-31, 05:13:43 UTC] {transport.py:1874} INFO - Connected (version 2.0, client OpenSSH_8.9p1) [2023-01-31, 05:13:43 UTC] {transport.py:1874} INFO - Authentication (publickey) failed. [2023-01-31, 05:13:43 UTC] {compute_ssh.py:258} INFO - Failed to connect. Waiting 3s to retry [2023-01-31, 05:13:48 UTC] {transport.py:1874} INFO - Connected (version 2.0, client OpenSSH_8.9p1) [2023-01-31, 05:13:48 UTC] {transport.py:1874} INFO - Authentication (publickey) failed. [2023-01-31, 05:13:48 UTC] {taskinstance.py:1904} ERROR - Task failed with exception Traceback (most recent call last): File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/ssh/operators/ssh.py", line 157, in execute with self.get_ssh_client() as ssh_client: File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/ssh/operators/ssh.py", line 124, in get_ssh_client return self.get_hook().get_conn() File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/compute_ssh.py", line 232, in get_conn sshclient = self._connect_to_instance(user, hostname, privkey, proxy_command) File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/compute_ssh.py", line 245, in _connect_to_instance client.connect( File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/compute_ssh.py", line 50, in connect return super().connect(*args, **kwargs) File "/opt/python3.8/lib/python3.8/site-packages/paramiko/client.py", line 450, in connect self._auth( File "/opt/python3.8/lib/python3.8/site-packages/paramiko/client.py", line 781, in _auth raise saved_exception File "/opt/python3.8/lib/python3.8/site-packages/paramiko/client.py", line 681, in _auth self._transport.auth_publickey(username, pkey) File "/opt/python3.8/lib/python3.8/site-packages/paramiko/transport.py", line 1635, in auth_publickey return self.auth_handler.wait_for_response(my_event) File "/opt/python3.8/lib/python3.8/site-packages/paramiko/auth_handler.py", line 259, in wait_for_response raise e paramiko.ssh_exception.AuthenticationException: Authentication failed. [2023-01-31, 05:13:48 UTC] {taskinstance.py:1408} INFO - Marking task as UP_FOR_RETRY. dag_id=run_data_transfer_configs_dag, task_id=create_transfer_run_directory, execution_date=20230131T051306, start_date=20230131T051332, end_date=20230131T051348 [2023-01-31, 05:13:48 UTC] {standard_task_runner.py:92} ERROR - Failed to execute job 1984 for task create_transfer_run_directory (Authentication failed.; 29045) [2023-01-31, 05:13:49 UTC] {local_task_job.py:156} INFO - Task exited with return code 1 [2023-01-31, 05:13:49 UTC] {local_task_job.py:279} INFO - 0 downstream tasks scheduled from follow-on schedule check
@Taragolis Can we have somebody to assist on the banner timeout errors? Its coming inspite of maintaining banner timeout parameters in Compute Engine SSH Operator
@vasu2809 greetings!
Apache Airflow is a OSS project, so there are no specific people who could support or assist. Only people who bump into the same problem or know about a specific part of Airflow or specific provider.
This mostly refers to the issue with Google Provider and how it integrated with SSH Provider. Most possible that this hook doesn't propagate all settings to SSHOperator.
I would recommend investigating this first. A lot of stuff in providers could be fixed locally, by creating your own hooks/operators and overwriting buggy parts or extending missing capabilities. That's basically how I did in the past. However a long term solution - contributes back to Airflow.
I don't feel confident with GCP and Google Provider, all I can do from my side is mark this issue as 'good first issue' and maybe someone could pick it up.
@Taragolis Can we have somebody to assist on the banner timeout errors? Its coming inspite of maintaining banner timeout parameters in Compute Engine SSH Operator
Also if you want assistance, then consider hiring a paid support. Airlfow (and this public forum is run by volunteers when when they have time.
Hi @vasu2809 ! Could you please share also the details of your Compute Engine instance you are using? Also, you mentioned that you were trying to connect to your Compute Engine in parallel with other ComputeEngineSSHHook tasks. Could you please also share what values you are using for use_oslogin, use_iap_tunnel and use_internal_ip parameters in those tasks? Will be helpful here. Thanks!
Apache Airflow version
Other Airflow 2 version (please specify below)
What happened
We are using ComputeEngineSSHHook for some of our Airflow DAGS in Cloud Composer
Everything works fine when DAGs run one by one
But when we start parallelism where multiple tasks are trying to connect to our GCE instance using ComputeEngineSSHHook at the same time,
We experience intermittent errors like the one give below
Since cloud composer by default has 3 retries, sometimes in the second or third attempt this issue gets resolved automatically but we would like to understand why this issue comes in the first place when there are multiple operators trying to generate keys and SSH into GCE instance
We have tried maintaining the DAG task with banner_timeout and expire_timeout parameters but we still see this issue
create_transfer_run_directory = SSHOperator( task_id="create_transfer_run_directory", ssh_hook=ComputeEngineSSHHook( instance_name=GCE_INSTANCE, zone=GCE_ZONE, use_oslogin=True, use_iap_tunnel=False, use_internal_ip=True, ), conn_timeout = 120, cmd_timeout = 120, banner_timeout = 120.0, command=f"sudo mkdir -p {transfer_run_directory}/" '{{ ti.xcom_pull(task_ids="load_config", key="transfer_id") }}', dag=dag, )
*[2023-01-31, 03:30:39 UTC] {compute_ssh.py:286} INFO - Importing SSH public key using OSLogin: user=edw-sa-gcc@pso-e2e-sql.iam.gserviceaccount.com [2023-01-31, 03:30:39 UTC] {compute_ssh.py:236} INFO - Opening remote connection to host: username=sa_115585236623848451866, hostname=10.128.0.29 [2023-01-31, 03:30:41 UTC] {transport.py:1874} ERROR - Exception (client): Error reading SSH protocol banner [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - Traceback (most recent call last): [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - File "/opt/python3.8/lib/python3.8/site-packages/paramiko/transport.py", line 2271, in _check_banner [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - buf = self.packetizer.readline(timeout) [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - File "/opt/python3.8/lib/python3.8/site-packages/paramiko/packet.py", line 380, in readline [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - buf += self._read_timeout(timeout) [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - File "/opt/python3.8/lib/python3.8/site-packages/paramiko/packet.py", line 609, in _read_timeout [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - raise EOFError() [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - EOFError [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - During handling of the above exception, another exception occurred: [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - Traceback (most recent call last): [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - File "/opt/python3.8/lib/python3.8/site-packages/paramiko/transport.py", line 2094, in run [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - self._check_banner() [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - File "/opt/python3.8/lib/python3.8/site-packages/paramiko/transport.py", line 2275, in _check_banner [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - raise SSHException( [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - paramiko.ssh_exception.SSHException: Error reading SSH protocol banner [2023-01-31, 03:30:41 UTC] {transport.py:1872} ERROR - [2023-01-31, 03:30:41 UTC] {compute_ssh.py:258} INFO - Failed to connect. Waiting 0s to retry [2023-01-31, 03:30:43 UTC] {transport.py:1874} INFO - Connected (version 2.0, client OpenSSH_8.9p1) [2023-01-31, 03:30:43 UTC] {transport.py:1874} INFO - Authentication (publickey) failed. [2023-01-31, 03:30:43 UTC] {compute_ssh.py:258} INFO - Failed to connect. Waiting 1s to retry [2023-01-31, 03:30:47 UTC] {transport.py:1874} INFO - Connected (version 2.0, client OpenSSH_8.9p1) [2023-01-31, 03:30:50 UTC] {transport.py:1874} INFO - Authentication (publickey) failed. [2023-01-31, 03:30:50 UTC] {compute_ssh.py:258} INFO - Failed to connect. Waiting 6s to retry [2023-01-31, 03:30:58 UTC] {transport.py:1874} INFO - Connected (version 2.0, client OpenSSH_8.9p1) [2023-01-31, 03:30:58 UTC] {transport.py:1874} INFO - Authentication (publickey) failed. [2023-01-31, 03:30:58 UTC] {taskinstance.py:1904} ERROR - Task failed with exception Traceback (most recent call last): File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/ssh/operators/ssh.py", line 157, in execute with self.get_ssh_client() as ssh_client: File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/ssh/operators/ssh.py", line 124, in get_ssh_client return self.get_hook().get_conn() File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/compute_ssh.py", line 232, in get_conn sshclient = self._connect_to_instance(user, hostname, privkey, proxy_command) File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/compute_ssh.py", line 245, in _connect_to_instance client.connect( File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/compute_ssh.py", line 50, in connect return super().connect(args, kwargs) File "/opt/python3.8/lib/python3.8/site-packages/paramiko/client.py", line 450, in connect self._auth( File "/opt/python3.8/lib/python3.8/site-packages/paramiko/client.py", line 781, in _auth raise saved_exception File "/opt/python3.8/lib/python3.8/site-packages/paramiko/client.py", line 681, in _auth self._transport.auth_publickey(username, pkey) File "/opt/python3.8/lib/python3.8/site-packages/paramiko/transport.py", line 1635, in auth_publickey return self.auth_handler.wait_for_response(my_event) File "/opt/python3.8/lib/python3.8/site-packages/paramiko/auth_handler.py", line 259, in wait_for_response raise e paramiko.ssh_exception.AuthenticationException: Authentication failed. [2023-01-31, 03:30:58 UTC] {taskinstance.py:1408} INFO - Marking task as UP_FOR_RETRY. dag_id=run_data_transfer_configs_dag, task_id=create_transfer_run_directory, execution_date=20230131T033002, start_date=20230131T033035, end_date=20230131T033058 [2023-01-31, 03:30:58 UTC] {standard_task_runner.py:92} ERROR - Failed to execute job 1418 for task create_transfer_run_directory (Authentication failed.; 21885)
What you think should happen instead
The SSH Hook operator should be able to seamlessly SSH into the GCE instance without any intermittent authentication issues
How to reproduce
No response
Operating System
Composer Kubernetes Cluster
Versions of Apache Airflow Providers
Composer Version - 2.1.3 Airflow version - 2.3.4
Deployment
Composer
Deployment details
Kubernetes Cluster GCE Compute Engine VM (Ubuntu)
Anything else
Very random and intermittent
Are you willing to submit PR?
Code of Conduct