ZettaAI / zetta_utils

MIT License
11 stars 0 forks source link

Outcome queue deleted before Flow completion, but no exit. #620

Open nkemnitz opened 9 months ago

nkemnitz commented 9 months ago

exec-dazzling-rose-gecko-of-perception was at 6057/6058 completed invert_field (subchunkable) tasks this morning, and the workers failed due to the missing outcome queue. Edit: Scheduler was frozen and did not react to Ctrl+C. Had to kill it.

What should happen in this case: The run should fail immediately when the pods keep dying - no need to waste resources with an unrecoverable error.

What also should happen: Not deleting the outcome queue before the last task is actually processed, but I can't replicate it.

2024-01-24T08:50:01Z  ╭───────────────────── Traceback (most recent call last) ──────────────────────╮
2024-01-24T08:50:01Z  │ /opt/conda/bin/zetta:8 in <module> │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ 5 from zetta_utils.cli.main import cli │
2024-01-24T08:50:01Z  │ 6 if __name__ == '__main__': │
2024-01-24T08:50:01Z  │ 7 │ sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0]) │
2024-01-24T08:50:01Z  │ ❱ 8 │ sys.exit(cli()) │
2024-01-24T08:50:01Z  │ 9 │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/site-packages/click/core.py:1137 in __call__ │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/site-packages/click/core.py:1062 in main │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/site-packages/click/core.py:1668 in invoke │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/site-packages/click/core.py:1404 in invoke │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/site-packages/click/core.py:763 in invoke │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/zetta_utils/zetta_utils/cli/main.py:106 in run │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ 103 │ if parallel_builder: │
2024-01-24T08:50:01Z  │ 104 │ │ zetta_utils.builder.PARALLEL_BUILD_ALLOWED = True │
2024-01-24T08:50:01Z  │ 105 │ │
2024-01-24T08:50:01Z  │ ❱ 106 │ result = zetta_utils.builder.build(spec, parallel=parallel_builder │
2024-01-24T08:50:01Z  │ 107 │ logger.debug(f"Outcome: {pprint.pformat(result, indent=4)}") │
2024-01-24T08:50:01Z  │ 108 │ if pdb: │
2024-01-24T08:50:01Z  │ 109 │ │ breakpoint() # pylint: disable=forgotten-debug-statement # pr │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/zetta_utils/zetta_utils/builder/build.py:53 in build │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/zetta_utils/zetta_utils/builder/build.py:62 in _build │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/zetta_utils/zetta_utils/builder/build.py:115 in _execute_build_stages │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/zetta_utils/zetta_utils/builder/build.py:93 in _build_object │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/zetta_utils/zetta_utils/builder/build.py:83 in _build_object │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/zetta_utils/zetta_utils/mazepa/worker.py:63 in run_worker │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ 60 │ │ │ │ return_value=None, │
2024-01-24T08:50:01Z  │ 61 │ │ │ ) │
2024-01-24T08:50:01Z  │ 62 │ │ │ outcome_report = OutcomeReport(task_id=constants.UNKNOWN_T │
2024-01-24T08:50:01Z  │ ❱ 63 │ │ │ outcome_queue.push([outcome_report]) │
2024-01-24T08:50:01Z  │ 64 │ │ │ raise e │
2024-01-24T08:50:01Z  │ 65 │ │ │
2024-01-24T08:50:01Z  │ 66 │ │ logger.info(f"Got {len(task_msgs)} tasks.") │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/zetta_utils/zetta_utils/message_queues/sqs/queue.py:64 in push │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ 61 │ │ │ for e in payloads: │
2024-01-24T08:50:01Z  │ 62 │ │ │ │ tq_task = TQTask(serialization.serialize(e)) │
2024-01-24T08:50:01Z  │ 63 │ │ │ │ tq_tasks.append(tq_task) │
2024-01-24T08:50:01Z  │ ❱ 64 │ │ │ self._get_tq_queue().insert(tq_tasks, parallel=self.insert │
2024-01-24T08:50:01Z  ╭───────────────────── Traceback (most recent call last) ──────────────────────╮
2024-01-24T08:50:01Z  │ 65 │ │
2024-01-24T08:50:01Z  │ 66 │ def _extend_msg_lease(self, duration_sec: int, msg: utils.SQSRecei │
2024-01-24T08:50:01Z  │ 67 │ │ utils.change_message_visibility( │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/zetta_utils/zetta_utils/message_queues/sqs/queue.py:49 in _get_tq_queue │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ 46 │ │
2024-01-24T08:50:01Z  │ 47 │ def _get_tq_queue(self) -> Any: │
2024-01-24T08:50:01Z  │ 48 │ │ if self._queue is None: │
2024-01-24T08:50:01Z  │ ❱ 49 │ │ │ self._queue = taskqueue.TaskQueue( │
2024-01-24T08:50:01Z  │ /opt/conda/bin/zetta:8 in <module> │
2024-01-24T08:50:01Z  │ 50 │ │ │ │ self.name, │
2024-01-24T08:50:01Z  │ 51 │ │ │ │ region_name=self.region_name, │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ 52 │ │ │ │ endpoint_url=self.endpoint_url, │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ 5 from zetta_utils.cli.main import cli │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/site-packages/taskqueue/taskqueue.py:69 in │
2024-01-24T08:50:01Z  │ __init__ │
2024-01-24T08:50:01Z  │ 6 if __name__ == '__main__': │
2024-01-24T08:50:01Z  │ 7 │ sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0]) │
2024-01-24T08:50:01Z  │ ❱ 8 │ sys.exit(cli()) │
2024-01-24T08:50:01Z  │ 9 │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/site-packages/click/core.py:1137 in __call__ │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/site-packages/taskqueue/taskqueue.py:90 in │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ initialize_api │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/site-packages/click/core.py:1062 in main │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/site-packages/taskqueue/aws_queue_api.py:58 in │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/site-packages/click/core.py:1668 in invoke │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ __init__ │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/site-packages/click/core.py:1404 in invoke │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/site-packages/tenacity/__init__.py:326 in │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/site-packages/click/core.py:763 in invoke │
2024-01-24T08:50:01Z  │ wrapped_f │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/zetta_utils/zetta_utils/cli/main.py:106 in run │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/site-packages/tenacity/__init__.py:406 in __call__ │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ 103 │ if parallel_builder: │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/site-packages/tenacity/__init__.py:362 in iter │
2024-01-24T08:50:01Z  │ 104 │ │ zetta_utils.builder.PARALLEL_BUILD_ALLOWED = True │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ 105 │ │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/site-packages/tenacity/__init__.py:195 in reraise │
2024-01-24T08:50:01Z  │ ❱ 106 │ result = zetta_utils.builder.build(spec, parallel=parallel_builder │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ 107 │ logger.debug(f"Outcome: {pprint.pformat(result, indent=4)}") │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/concurrent/futures/_base.py:451 in result │
2024-01-24T08:50:01Z  │ 108 │ if pdb: │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ 109 │ │ breakpoint() # pylint: disable=forgotten-debug-statement # pr │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/concurrent/futures/_base.py:403 in __get_result │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/zetta_utils/zetta_utils/builder/build.py:53 in build │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/site-packages/tenacity/__init__.py:409 in __call__ │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/zetta_utils/zetta_utils/builder/build.py:62 in _build │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/zetta_utils/zetta_utils/builder/build.py:115 in _execute_build_stages │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/zetta_utils/zetta_utils/builder/build.py:93 in _build_object │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/site-packages/taskqueue/aws_queue_api.py:65 in │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ _get_qurl │
2024-01-24T08:50:01Z  │ /opt/zetta_utils/zetta_utils/builder/build.py:83 in _build_object │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/site-packages/botocore/client.py:534 in _api_call │
2024-01-24T08:50:01Z  │ /opt/zetta_utils/zetta_utils/mazepa/worker.py:63 in run_worker │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/site-packages/botocore/client.py:976 in │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ _make_api_call │
2024-01-24T08:50:01Z  │ 60 │ │ │ │ return_value=None, │
2024-01-24T08:50:01Z  │ 61 │ │ │ ) │
2024-01-24T08:50:01Z  │ 62 │ │ │ outcome_report = OutcomeReport(task_id=constants.UNKNOWN_T │
2024-01-24T08:50:01Z  │ ❱ 63 │ │ │ outcome_queue.push([outcome_report]) │
2024-01-24T08:50:01Z  │ 64 │ │ │ raise e │
2024-01-24T08:50:01Z  │ 65 │ │ │
2024-01-24T08:50:01Z  ╰──────────────────────────────────────────────────────────────────────────────╯
2024-01-24T08:50:01Z  │ 66 │ │ logger.info(f"Got {len(task_msgs)} tasks.") │
2024-01-24T08:50:01Z  QueueDoesNotExist: An error occurred (AWS.SimpleQueueService.NonExistentQueue)
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  when calling the GetQueueUrl operation: The specified queue does not exist for
2024-01-24T08:50:01Z  │ /opt/zetta_utils/zetta_utils/message_queues/sqs/queue.py:64 in push │
2024-01-24T08:50:01Z  this wsdl version.
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ 61 │ │ │ for e in payloads: │
2024-01-24T08:50:01Z  │ 62 │ │ │ │ tq_task = TQTask(serialization.serialize(e)) │
2024-01-24T08:50:01Z  Exception occured while building "spec" (mapped to "run_worker" from module
2024-01-24T08:50:01Z  │ 63 │ │ │ │ tq_tasks.append(tq_task) │
2024-01-24T08:50:01Z  "zetta_utils.mazepa.worker")
2024-01-24T08:50:01Z  │ ❱ 64 │ │ │ self._get_tq_queue().insert(tq_tasks, parallel=self.insert │
2024-01-24T08:50:01Z  │ 65 │ │
2024-01-24T08:50:01Z  │ 66 │ def _extend_msg_lease(self, duration_sec: int, msg: utils.SQSRecei │
2024-01-24T08:50:01Z  │ 67 │ │ utils.change_message_visibility( │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/zetta_utils/zetta_utils/message_queues/sqs/queue.py:49 in _get_tq_queue │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ 46 │ │
2024-01-24T08:50:01Z  │ 47 │ def _get_tq_queue(self) -> Any: │
2024-01-24T08:50:01Z  │ 48 │ │ if self._queue is None: │
2024-01-24T08:50:01Z  │ ❱ 49 │ │ │ self._queue = taskqueue.TaskQueue( │
2024-01-24T08:50:01Z  │ 50 │ │ │ │ self.name, │
2024-01-24T08:50:01Z  │ 51 │ │ │ │ region_name=self.region_name, │
2024-01-24T08:50:01Z  │ 52 │ │ │ │ endpoint_url=self.endpoint_url, │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/site-packages/taskqueue/taskqueue.py:69 in │
2024-01-24T08:50:01Z  │ __init__ │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/site-packages/taskqueue/taskqueue.py:90 in │
2024-01-24T08:50:01Z  │ initialize_api │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/site-packages/taskqueue/aws_queue_api.py:58 in │
2024-01-24T08:50:01Z  │ __init__ │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/site-packages/tenacity/__init__.py:326 in │
2024-01-24T08:50:01Z  │ wrapped_f │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/site-packages/tenacity/__init__.py:406 in __call__ │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/site-packages/tenacity/__init__.py:362 in iter │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/site-packages/tenacity/__init__.py:195 in reraise │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/concurrent/futures/_base.py:451 in result │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/concurrent/futures/_base.py:403 in __get_result │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/site-packages/tenacity/__init__.py:409 in __call__ │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/site-packages/taskqueue/aws_queue_api.py:65 in │
2024-01-24T08:50:01Z  │ _get_qurl │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/site-packages/botocore/client.py:534 in _api_call │
2024-01-24T08:50:01Z  │ │
2024-01-24T08:50:01Z  │ /opt/conda/lib/python3.10/site-packages/botocore/client.py:976 in │
2024-01-24T08:50:01Z  │ _make_api_call │
2024-01-24T08:50:01Z  ╰──────────────────────────────────────────────────────────────────────────────╯
2024-01-24T08:50:01Z  QueueDoesNotExist: An error occurred (AWS.SimpleQueueService.NonExistentQueue)
2024-01-24T08:50:01Z  when calling the GetQueueUrl operation: The specified queue does not exist for
2024-01-24T08:50:01Z  this wsdl version.
2024-01-24T08:50:01Z  Exception occured while building "spec" (mapped to "run_worker" from module
2024-01-24T08:50:01Z  "zetta_utils.mazepa.worker")
nkemnitz commented 9 months ago

The no deletion part is resolved: GC was not enabled for that project. But the original cause for the frozen scheduler is unknown.