astronomer / astronomer-providers

Airflow Providers containing Deferrable Operators & Sensors from Astronomer
https://astronomer-providers.rtfd.io/
Apache License 2.0
135 stars 25 forks source link

Redshift cluster management DAG is failing #941

Closed pankajkoti closed 1 year ago

pankajkoti commented 1 year ago

Redshift cluster management DAG is failing with the below errors:


[2023-04-03, 00:23:41 UTC] {taskinstance.py:1769} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/airflow/providers/amazon/aws/operators/redshift_cluster.py", line 327, in execute
    self.redshift_hook.get_conn().get_waiter("snapshot_available").wait(
  File "/usr/local/lib/python3.9/site-packages/botocore/waiter.py", line 55, in wait
    Waiter.wait(self, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/botocore/waiter.py", line 388, in wait
    raise WaiterError(
botocore.exceptions.WaiterError: Waiter SnapshotAvailable failed: Max attempts exceeded
[2023-04-03, 00:25:46 UTC] {taskinstance.py:1769} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/airflow/providers/amazon/aws/operators/redshift_cluster.py", line 320, in execute
    self.redshift_hook.create_cluster_snapshot(
  File "/usr/local/lib/python3.9/site-packages/airflow/providers/amazon/aws/hooks/redshift_cluster.py", line 171, in create_cluster_snapshot
    response = self.get_conn().create_cluster_snapshot(
  File "/usr/local/lib/python3.9/site-packages/botocore/client.py", line 530, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.9/site-packages/botocore/client.py", line 960, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.ClusterSnapshotAlreadyExistsFault: An error occurred (ClusterSnapshotAlreadyExists) when calling the CreateClusterSnapshot operation: Cannot create the snapshot because a snapshot with the identifier astro-providers-cluster-snapshot already exists.
pankajkoti commented 1 year ago

For one of the recent runs of the DAG, the preceding redshift_sensor task was stuck in deferred state and somehow did not respect the timeout parameter available with Airflow sensors and as a result, the subsequent delete_snapshot and delete_cluster tasks were not trigerred. I have marked the stuck task as failed now. And the DAG seems to be running fine now. This seems to be a one-off infra issue and I am closing the ticket now. If we observe this infra issue, we can reopen the ticket and also file a Zendesk ticket.