flyteorg / flyte

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
https://flyte.org
Apache License 2.0
5.42k stars 581 forks source link

[BUG] AWS batch job failed and plugin report failed but flyte console shows task still running #2979

Open jw0515 opened 1 year ago

jw0515 commented 1 year ago

Describe the bug

When there's an exception happened, flyte will catch the error and the AWS batch job status goes into a SUCCEEDED state and the flyte AWS batch plugin reports catch the error back. So when clicking the running task on the execution page the task "Map Execution" tab will show the AWS batch job failed. But on the execution page, the task's status is still "running" and never stops.

One can only abort the execution to stop the execution.

Expected behavior

Once the exception happened, although flyte catches it, the AWS batch job status should go to "Failed", the flyte task should failed and the execution should stop.

Additional context to reproduce

@workflow
def batch_inference_pipeline(model_path: str, scaler_path: str) -> int:

    inference_inputs = prepare_inference_inputs(model_path=model_path, scaler_path=scaler_path)
    batch_inference(inference_inputs=inference_inputs)
    return 0
config = AWSBatchConfig(
    platformCapabilities="EC2",
)
@task(requests=Resources(mem="16Gi", cpu="8"), task_config=config)
def batch_inference(inference_inputs: List[InferenceInput]) -> int:
    # pool = multiprocessing.Pool()
    # pool.map(inference, inference_inputs)
    for inference_input in inference_inputs:
        inference(inference_input)
    return 0

https://flyte-org.slack.com/archives/C01P3B761A6/p1664980783943249

Screenshots

image image

This shows that flyte catches the error and log the error but didn't raise the exception in aws batch: image

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

welcome[bot] commented 1 year ago

Thank you for opening your first issue here! 🛠

jw0515 commented 1 year ago

Ok, it seems no matter if the batch job state is succeeded or failed the flyte console is always hanging, this time the exception is thrown instead of captured in a batch job, so the batch job goes to the "FAILED" state. Flyte console still show it running.

image

@pingsutw

pingsutw commented 1 year ago

@jw0515 Thanks, I'm looking at this issue. will get back to you once I know how to address this issue.

github-actions[bot] commented 10 months ago

Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏

github-actions[bot] commented 10 months ago

Hello 👋, This issue has been inactive for over 9 months and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! 🙏

github-actions[bot] commented 1 month ago

Hello 👋, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable. Thank you for your contribution and understanding! 🙏