flyteorg / flyte

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
https://flyte.org
Apache License 2.0
5.39k stars 577 forks source link

[BUG] Flyte List objects should be offloaded or offloadable. Also In case of large outputs, propeller should return an error #381

Open kumare3 opened 4 years ago

kumare3 commented 4 years ago

Describe the bug For an example workflow

{"json":{"exec_id":"z4b5bbelyu","node":"write-task","ns":"...-development","res_ver":"1085530040","routine":"worker-357","src":"plugin_manager.go:257","tasktype":"spark","wf":"...."},"level":"info","msg":"Plugin [spark] returned no outputReader, assuming file based outputs","ts":"2020-07-08T21:11:03Z"}
{"json":{"exec_id":"z4b5bbelyu","node":"write-task","ns":"...-development","res_ver":"1085530040","routine":"worker-357","src":"pre_post_execution.go:113","tasktype":"spark","wf":".."},"level":"error","msg":"Failed to check if the output file exists. Error: error file @[s3://..../error.pb] is too large [13117404] bytes, max allowed [10485760] bytes","ts":"2020-07-08T21:11:03Z"}

This is actually the output is too large. But the workflow gets stuck instead of failing instantly unable to load the file

Expected behavior Failure should be propagated to admin with the size restrictions

Flyte component

To Reproduce Steps to reproduce the behavior:

  1. ...
  2. ...

Screenshots ...

Additional context The workflows continue to run to eventually result in system error, while this is actually a user error.

yindia commented 3 years ago

@kumare3 Assign to me

kumare3 commented 3 years ago

I think is resolved, but will do some round of testing

sonjaer commented 2 years ago

Also In case of large outputs, propeller should return an error

We experienced this and did not get any user facing error message in the flyteconsole, only after checking propeller logs we saw what was wrong:

"Failed to check if the output file exists. Error: error file @[gs://flytepropeller-production-storage/metadata/propeller/styx-garbage-collection-production-gdswap6hpgrdprpcdm2i/n1/data/0/error.pb] is too large [39611420] bytes, max allowed [10485760] bytes"

The pod completed successfully, but flyteconsole was showing Running and [ContainersNotReady|ContainerCreating]:containers with unready status: [evvtse3zew5gmjd6pzxr-n1-0]|

We reduced the size of the task output but this made the code a little bit harder to read (We could also output the data to our own file in GCS from the task). What is the guideline/tradeoffs regarding the size of the inputs/outputs? We have it set to 10MB in our config currently, is that the recommended size?

github-actions[bot] commented 1 year ago

Hello πŸ‘‹, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! πŸ™

github-actions[bot] commented 1 year ago

Hello πŸ‘‹, This issue has been inactive for over 9 months and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! πŸ™

github-actions[bot] commented 1 month ago

Hello πŸ‘‹, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable. Thank you for your contribution and understanding! πŸ™

nikp1172 commented 2 weeks ago

This issue still seems to be there in v1.13.0

Any update on this?

nikp1172 commented 1 week ago

Adding a reproducible example for the same:

from flytekit import  task, workflow, ImageSpec
from typing import List, Tuple, Union

normal_image = ImageSpec(
    base_image="python:3.9-slim",
    packages=["flytekit==1.10.3"],
    registry="ttl.sh",
    name="skdjbKBJ1341-normal",
    source_root="..",
)

@task(container_image=normal_image)
def print_arrays(arr1: str) -> None:
    print(f"Array 1: {arr1}")

@task(container_image=normal_image)
def increase_size_of_of_arrays(n: int) -> str:
    arr1 = 'a' * n * 1024
    return arr1

# Workflow: Orchestrate the tasks
@workflow
def simple_pipeline(n: int) -> int:
    arr1 = increase_size_of_of_arrays(n=n)
    print_arrays(arr1)
    return 2

# Runs the pipeline locally
if __name__ == "__main__":
    result = simple_pipeline(n=5)

I just verified it running in flyte sandbox also. Please register the above file.

pyflyte --pkgs limit_eg  package -f --source .
flytectl register files --project flytesnacks --domain development --archive flyte-package.tgz --version 1

Then run it from UI with n=11000 (i/o size will be some 11 MB). This workflow is forever stuck in running now.

I can just see the propeller logs:

[failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: output file @[s3://my-s3-bucket/metadata/propeller/flytesnacks-development-at9h4tkhqx5fjbp6sbfm/n0/data/0/outputs.pb] is too large [11264029] bytes, max allowed [10485760] bytes]. Error Type[*errors.NodeErrorWithCause]","ts":"2024-08-25T10:53:09Z"}
E0825 10:53:09.978923       1 workers.go:103] error syncing 'flytesnacks-development/at9h4tkhqx5fjbp6sbfm': failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: output file @[s3://my-s3-bucket/metadata/propeller/flytesnacks-development-at9h4tkhqx5fjbp6sbfm/n0/data/0/outputs.pb] is too large [11264029] bytes, max allowed [10485760] bytes

These are not propagated as an event to flyteadmin and hence its stuck forever without any error message on flyte console

wild-endeavor commented 4 days ago

Please take a look at this rfc and the PRs linked from it @nikp1172 - we're approaching it as a two part project. The first part solely focused on offloading large lists that are the output of a map task. The latter parts will extend to all large values, even large scalars. Unf. the issue repro'ed above here will not be addressed in the first part of the project, but please follow along if you'd like.

wild-endeavor commented 4 days ago

Didn't get a chance to run this locally but the error message is being returned by this exists() call, being invoked here which calls that exists function. @pvditt are you familiar with this part of the handler? Is there a quick and safe way we can mark the error received here inside that ValidateOutput() function as a non-retriable error?