coiled / feedback

A place to provide Coiled feedback
14 stars 3 forks source link

Unable to import module from python package #175

Closed FabioRosado closed 2 years ago

FabioRosado commented 2 years ago

Opening this issue for Christos Didachos. From Slack:

Hello everyone, I am afraid that this question is more dask related, but any help could be welcome :) I use dask distributed & coiled to do some computations in a cloud cluster. When I run it locally it works correctly. When I run a simple function in a distributed system (AWS ec2 workers) it works correctly. But if my function has dependencies and imports - uses other packages then it can not find them (Import error). For example the below code works fine:

from dask.distributed import Client
import coiled

def return_one():
    return 128

if __name__ == '__main__':
    cluster = coiled.Cluster(name="test", n_workers=2, backend_options={"region": "eu-west-1"})
    client = Client(cluster)
    print('Dashboard:', client.dashboard_link)
    remote_processes = []
    results_list = []
    for i in range(2):
        futures = client.submit(return_one)
        remote_processes.append(futures)
    for pr in remote_processes:
        results = pr.result()
        results_list.append(results)

The above function is being executed correctly in n remote workers and I am gathering the results from all of these workers. But if the function was something like that:

from dask.distributed import Client
from my_package import one_function
import coiled

def return_one():
    x = one_function()
    return x

Then I will receive an import error for the package "my_package". I suppose that workers do not have my local code and can not import my packages. How could I overpass this? All examples that I found in the tutorials using simple functions (sum, square etc. which do not use other packages). Thank you in advance!

My reply:

Hello @christos didachos thank you for your question. When you create a cluster like the one you shared, it will use the coiled runtime in the software environment. Coiled runtime has the most common packages that you may need, such as dask, pyarrow, s3fs, scikit, numba, etc. If you need to use a different package, you will need to create a software environment with the command:

import coiled
coiled.create_software_environment(name="my-custom-env", conda={'channels': ["conda-forge"], 'dependencies': ["xyz"]}

You may want to include the coiled runtime in your dependencies. If you want to create a software environment with just one or two packages, make sure you include dask. If you have your package in a private repository, you may want to read how to create software environments with private repositories https://docs.coiled.io/user_guide/software_environment_creation.html#private-repositories It’s always a good idea to specify the same version of the packages that you have installed locally (or create a new conda environment) so the cluster doesn’t get any version mismatches.

Christos reply:

Hi Fabio, Thank you very much for your prompt response, fully appreciated. I do not think it is related to the software environment. For the example that I posted I am using the coiled runtime env but I have tried it with other software environments too. Probably it is confusing as I used the term package. I can import python packages (using the function coiled.create_software_environment and starting the cluster with parameter software="my_software_env"). The problem occurs if I want to import a function from another python file. For example the function that I initially posted with the name return_one() will be executed to all workers. But it exists in the python script that I am defining the cluster. If this function was located in another directory and I was importing it using from something import return_one, in that case if I run the cluster locally it works fine, but if I use a remote cluster it will not import the python file something (it returns import error). Thank you very much


I've pointed Christos to try the UploadDirectory with update_path=True to see if it fixes the issue while I open this issue.

cc @phobson

ncclementi commented 2 years ago

I think if this is in a separate file, they'll want to use client.upload_file() instead. See https://docs.dask.org/en/stable/how-to/manage-environments.html#send-source and specifically . I think this SO question will be very helpful. https://stackoverflow.com/questions/39295200/can-i-use-functions-imported-from-py-files-in-dask-distributed

I don't think UploadDirectory is what they need in this case.

phobson commented 2 years ago

Closing this b/c 1) the documentation that Naty posted is correct and 2) package_sync is on its way to solve this