flyteorg / flyte

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
https://flyte.org
Apache License 2.0
5.47k stars 584 forks source link

[Core feature] UX improvement: `pyflyte run` includes imported local modules in the target workflow file #5492

Open cosmicBboy opened 3 months ago

cosmicBboy commented 3 months ago

Motivation: Why do you think this is important?

Say I have the following flytekit project folder structure:

/my_project
    /src
        __init__.py
        module_a.py
        ... # etc
    tasks.py  # imports modules from src
    workflows.py  # imports modules from task
    setup.py

Currently, for this workflow to run correctly with pyflyte run --remote, I have to pass in the --copy-all flag to make sure that all the source files are present in the container running the workflow.

This isn't a great beginner UX: I would expect a local pyflyte run and remote pyflyte run --remote to "just work".

Goal: What should the final outcome look like, ideally?

Today, pyflyte run --remote workflow.py will fast-register the workflow, meaning that the workflow.py file itself will be shipped as a tar file and injected into the container running in the remote pod.

Ideally, pyflyte run should also automatically detect all of the locally defined modules that are being used in workflow.py and include it in the tar file being fast registered.

Describe alternatives you've considered

Keep the status quo, which is to require users to provide the --copy-all flag. There's also another issue proposing a --copy flag (see here) which proposes a more targeted version of the --copy-all flag.

Propose: Link/Inline OR Additional context

No response

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

runllm[bot] commented 3 months ago

Based on the available data sources, here is the issue title and body for the UX improvement:


Issue Title: [Core feature] UX improvement: pyflyte run includes imported local modules in the target workflow file

Issue Body:

Motivation: Why do you think this is important?

Say I have the following Flytekit project folder structure:

/my_project
    /src
        __init__.py
        module_a.py
        ... # etc
    tasks.py  # imports modules from src
    workflows.py  # imports modules from task
    setup.py

Currently, for this workflow to run correctly with pyflyte run --remote, I have to pass in the --copy-all flag to make sure that all the source files are present in the container running the workflow.

This isn't a great beginner UX: I would expect a local pyflyte run and remote pyflyte run --remote to "just work".

Goal: What should the final outcome look like, ideally?

Today, pyflyte run --remote workflow.py will fast-register the workflow, meaning that the workflow.py file itself will be shipped as a tar file and injected into the container running in the remote pod.

Ideally, pyflyte run should also automatically detect all of the locally defined modules that are being used in workflow.py and include them in the tar file being fast registered.

Describe alternatives you've considered

Keep the status quo, which is to require users to provide the --copy-all flag. There's also another issue proposing a --copy flag (see here) which proposes a more targeted version of the --copy-all flag.

Propose: Link/Inline OR Additional context

No response

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?


Sources:


This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @runllm to ask follow-up questions. See runllm.com for more!

thomasjpfan commented 3 months ago

If there is a setup.py + src directory, then I assume there is a package to be installed with the src layout. There are two ways to make the pyflyte run --remote UX work:

  1. Use ImageSpec and actually install the library
  2. If there is a setup.py assume we want to --copy-all and install the package during execution.
  3. Check the workflow to see what it imports, then automatically copy things that get imported with the correct folder structure. This only works for python only modules.

The solution depends on how task imports module_a. If its import module_a, then we need to install the package or manipulate the folder structure.. If it's import src.module_a, then src is kept as is. (Personally, I do not really like using import src.module_a, but I've see user code like this)