NASA-PDS / nucleus

Nucleus is a software platform used to create workflows for the Planetary Data (PDS).
https://nasa-pds.github.io/nucleus
Apache License 2.0
0 stars 0 forks source link

Nucleus Airflow ECEOperator failed to launch ECS task with error "capacity is unavailable at this time" #86

Closed ramesh-maddegoda closed 2 months ago

ramesh-maddegoda commented 4 months ago

Checked for duplicates

Yes - I've already checked

🐛 Describe the bug

When Nucleus tried to process a large amount of data (a PDS data directory of Terra Bytes of size), the Airflow ECS Operator of Nucleus failed with the following error.

airflow.providers.amazon.aws.exceptions.EcsOperatorError: {'tasks': [], 'failures': [{'reason': 'Capacity is unavailable at this time. Please try again later or in a different availability zone'}], 'ResponseMetadata': {'RequestId': 'bac90ba4-ca6d-4305-a04e-a08d95750de8', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'bac90ba4-ca6d-4305-a04e-a08d95750de8', 'content-type': 'application/x-amz-json-1.1', 'content-length': '135', 'date': 'Thu, 22 Feb 2024 08:03:23 GMT'}, 'RetryAttempts': 0}} [2024-02-22, 08:03:24 UTC] {{taskinstance.py:1345}} INFO - Marking task as FAILED. dag_id=PDS_Basic_Registry_Use_Case_Messenger-jplaws, task_id=Validate_Products, execution_date=20240222T080314, start_date=20240222T080323, end_date=20240222T080324 [2024-02-22, 08:03:24 UTC] {{standard_task_runner.py:104}} ERROR - Failed to execute job 76966 for task Validate_Products ({'tasks': [], 'failures': [{'reason': 'Capacity is unavailable at this time. Please try again later or in a different availability zone'}], 'ResponseMetadata': {'RequestId': 'bac90ba4-ca6d-4305-a04e-a08d95750de8', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'bac90ba4-ca6d-4305-a04e-a08d95750de8', 'content-type': 'application/x-amz-json-1.1', 'content-length': '135', 'date': 'Thu, 22 Feb 2024 08:03:23 GMT'}, 'RetryAttempts': 0}}; 15455) [2024-02-22, 08:03:24 UTC] {{local_task_job_runner.py:225}} INFO - Task exited with return code 1 [2024-02-22, 08:03:24 UTC] {{taskinstance.py:2653}} INFO - 0 downstream tasks scheduled from follow-on schedule check

🕵️ Expected behavior

I expected the Nucleus to process a large amount of data (a PDS data directory of Terra Bytes of size)without failing.

📜 To Reproduce

Tigger Nucleus workflow PDS_Basic_Registry_Use_Case_Messenger-jplaws with over 1000 times in a short duration.

🖥 Environment Info

📚 Version of Software Used

No response

🩺 Test Data / Additional context

No response

🦄 Related requirements

🦄 #xyz

⚙️ Engineering Details

No response

jordanpadams commented 2 months ago

No longer applicable due to change in architecture.