PrefectHQ / prefect

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
https://prefect.io
Apache License 2.0
16.1k stars 1.57k forks source link

Github Submodule does not get clone with Prefect 2.10.9 #9555

Closed akmukherjee closed 1 year ago

akmukherjee commented 1 year ago

First check

Bug summary

I have abug report for Submodules initialization in Prefect Projects. Currently my prefect.yaml looks as shown below:

# Generic metadata about this project
name: icdc-dataloader
prefect-version: 2.10.9

# build section allows you to manage and build docker images
build: null

# push section allows you to manage if and how this project is uploaded to remote locations
push: null

# pull section allows you to provide instructions for cloning this project in remote locations
pull:
- prefect.projects.steps.git_clone_project:
    repository: https://github.com/akmukherjee/prefect-test
    branch: main
    include_submodules: true

This set up does not pull the submodules associated with this repo. This is really easy to reproduce and I have reproduced it in my code here and this has been discussed here and merged here.

Reproduction

# Generic metadata about this project
name: icdc-dataloader
prefect-version: 2.10.9

# build section allows you to manage and build docker images
build: null

# push section allows you to manage if and how this project is uploaded to remote locations
push: null

# pull section allows you to provide instructions for cloning this project in remote locations
pull:
- prefect.projects.steps.git_clone_project:
    repository: https://github.com/akmukherjee/prefect-test
    branch: main
    include_submodules: true

Error

Please see the screenshots here: https://github.com/PrefectHQ/prefect/issues/9462

Versions

Version:             2.10.9
API version:         0.8.4
Python version:      3.8.3
Git commit:          1655c1fa
Built:               Thu, May 11, 2023 2:29 PM
OS/Arch:             win32/AMD64
Profile:             default
Server type:         cloud

Additional context

No response

akmukherjee commented 1 year ago

@taylor-curran : Just created the bug report.

desertaxle commented 1 year ago

Thanks for the issue @akmukherjee, and thanks for sharing a project to use for reproduction! If I set the working directory for a process work pool, I can see that the git submodule is being cloned along with base repository. However, the error where Python is not able to load your submodule is a puzzling one. I will continue investigating that error and report back with what I find!

akmukherjee commented 1 year ago

Thanks Alex! Very Respectfully, Amit

On Fri, May 12, 2023 at 3:05 PM Alexander Streed @.***> wrote:

Thanks for the issue @akmukherjee https://github.com/akmukherjee, and thanks for sharing a project to use for reproduction! If I set the working directory for a process work pool, I can see that the git submodule is being cloned along with base repository. However, the error where Python is not able to load your submodule is a puzzling one. I will continue investigating that error and report back with what I find!

— Reply to this email directly, view it on GitHub https://github.com/PrefectHQ/prefect/issues/9555#issuecomment-1546161306, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABTZDJQHE7ZIHSB5XP53PYTXF2CXHANCNFSM6AAAAAAX7Z4HN4 . You are receiving this because you were mentioned.Message ID: @.***>

desertaxle commented 1 year ago

If you import from a submodule inside your flow, I found that you'll get the import error you saw (it seems related to https://github.com/PrefectHQ/prefect/issues/9542). When I moved the import of your git submodule to the top of the file, your example flow started working. You can see the change I made on my fork of your repository.

The git submodules functionality for the git_clone_project appears to work as expected. I will close this issue, but if you encounter any other issue with git submodules in projects, feel free to open a new issue or reopen this one!

akmukherjee commented 1 year ago

Thanks @desertaxle . That does indeed work. I tested it out with your fork of the repo. Thanks a bunch.!

MrChadMWood commented 9 months ago

This will also impact any implementation of lazy loading within submodules that point to other submodules.

For example, I tried building a prefect flow (deployment.py) that simply wraps all functions from main with a new function using the same name. Something like this:

# main.py

def foo(arg1, arg2):
    do_stuff()
# deployment.py

@task
def foo(*args, **kwargs):
    from main import foo as wrapped_func
    return wrapped_func(*args, **kwargs)

This fails due to the issue described in this thread. I updated approach to this instead:

# deployment.py
from main import foo as _foo

@task
def foo(*args, **kwargs):
    return _foo(*args, **kwargs)

but then it fails because main.py also implements lazy loading:

prefect_worker  |   File "/tmp/tmpck2dkmalprefect/git_repo-branch_name/deployment.py", line 41, in datasource_extract
prefect_worker  |     return _datasource_extract(*args, **kwargs)
prefect_worker  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
prefect_worker  |   File "/tmp/tmpck2dkmalprefect/git_repo-branch_name/main.py", line 69, in datasource_extract
prefect_worker  |     from datasource.src.extract import get_session
prefect_worker  | ModuleNotFoundError: No module named 'datasource'

For reference, my project directory looks something like this:

./project/
-|-deployment.py
-|-main.py
-|-
-|-/datasource/
-|-|-/src/
-|-|-|-extract.py
-|-|-|-transform.py
-|-
-|-/storage/
-|-|-/src/
-|-|-|-load.py

I guess the next step would be to completely remove lazy loading of submodules from everywhere in the project.