flyteorg / flyte

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
https://flyte.org
Apache License 2.0
5.77k stars 658 forks source link

[BUG] Exception thrown when waiting for execution to finish #5349

Open ggydush opened 6 months ago

ggydush commented 6 months ago

Describe the bug

The following exception is thrown sometimes when executing a workflow with a dynamic that spins up multiple tasks:

           elif e.code() == grpc.StatusCode.NOT_FOUND:
>               raise FlyteEntityNotExistException() from e
E               flytekit.exceptions.user.FlyteEntityNotExistException: USER:EntityNotExist: error=None, cause=<_InactiveRpcError of RPC that terminated with:
E                       status = StatusCode.NOT_FOUND
E                       details = "failed to fetch object: rpc error: code = NotFound desc = request failed with status code 404. Body: {"code":5,"message":"object 's3://union-compute-us-east-1-delvebio/metadata/propeller/clinical-pipeline-development-f76e3bd80dbf548ae8dc/n1/data/0/n3/0/outputs.pb' not found","details":[]}"
E                       debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2024-05-10T12:35:04.72984-07:00", grpc_status:5, grpc_message:"failed to fetch object: rpc error: code = NotFound desc = request failed with status code 404. Body: {\"code\":5,\"message\":\"object \'s3://union-compute-us-east-1-delvebio/metadata/propeller/clinical-pipeline-development-f76e3bd80dbf548ae8dc/n1/data/0/n3/0/outputs.pb\' not found\",\"details\":[]}"}"
E               >

Expected behavior

No exception when waiting for execution

Additional context to reproduce

execution = remote.execute(
    name,
    inputs=inputs,
    version=version or None,
    **kwargs,
)
remote.wait(execution)

Screenshots

No response

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

runllm[bot] commented 6 months ago

The error you're encountering, indicated by FlyteEntityNotExistException, suggests that the system is unable to locate a specific object, likely due to the dynamic task's outputs not being found at the expected location in S3. This could be a result of the object not being created or being deleted before retrieval. This issue is similar to problems discussed in Flyte GitHub issue #3639 where dynamic tasks sometimes lead to missing entities, causing failures when the system expects those entities to exist.

Would you like more detailed information on potential causes and solutions for this issue?


This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @runllm to ask follow-up questions. See runllm.com for more!