unclear how to download data from web into a remote client

coiled / feedback

A place to provide Coiled feedback

14 stars 3 forks source link

unclear how to download data from web into a remote client #120

Closed murphyk closed 1 year ago

murphyk commented 3 years ago

I am trying to modify the dask pytorch finetuning example so that it runs on a coiled client. My modified code is here. The script downloads the data locally using

filename, _ = urllib.request.urlretrieve("https://download.pytorch.org/tutorial/hymenoptera_data.zip", "data.zip")
zipfile.ZipFile(filename).extractall()

Not surprisingly, when I run predictions = dask.compute(*predictions), it fails to find the data, saying

Errno 2] No such file or directory: 'hymenoptera_data/val/ants/181942028_961261ef48.jpg'

It is very unclear how a remote task can access this kind of image data - does it need to be downloaded into some kind of dask format? How does that work for a set of images stored in a zip file? What if the data is stored at https://www.tensorflow.org/datasets or https://pytorch.org/vision/0.8/datasets.html? How can we work with such data?

FabioRosado commented 3 years ago

Hello @murphyk thank you for your question. The examples notebook are a good starting place to see how you can use Coiled.

If you plan on working on a variation of an example notebook (or start from scratch), we would recommend you to create a notebook with the coiled.create_notebook command. Using that command allows you to specify any file(s) you might need on your coiled notebook.

In this snippet I am using the dependencies that the hyperband-optimization notebook uses. Feel free to modify however it might suit your needs.

import coiled

coiled.create_notebook(
    name="pytorch-finetuning", 
    conda={
        "channels": ["conda-forge", "pytorch", "defaults"], 
        "dependencies": ["coiled=0.0.36", "dask-ml", "dask>=2.29.0", "matplotlib", "numpy", "pandas>=1.1.0", "python=3.8", "pytorch>1.1.0", "s3fs", "scipy", "skorch"]
    },
    files=["..."]
)

When dealing with a zip file like the hymenoptera_data.zip there are two ways:

Extract the folder and upload all of the contents
send the zip file and then pip install a dependency that can extract the files (for example patool)

Regarding your question of how to read data from different sources, Dask can read from various remote data locations including HTTP. If it's impractical to upload those files to the notebook, you could open a local notebook, import coiled and run the computations on coiled instead of using a local cluster.

I hope this helps you

murphyk commented 3 years ago

Thanks, I'll take a look. Meanwhile I tried to repeat the above experiment using this colab. This downloads the data into the colab VM and starts a coiled cluster for fask. Interestingly, when I run the code, it seems to find the files, but gives a different error:

distributed.protocol.pickle - INFO - Failed to deserialize b"\x80\x05\x95O\x03\x00\x00\x00\x00\x00\x00\x8c\x16tblib.pickling_support\x94\x8c\x12unpickle_exception\x94\x93\x94(\x8c\x08builtins\x94\x8c\tTypeError\x94\x93\x94\x8c'an integer is required (got type bytes)\x94\x85\x94Nh\x00\x8c\x12unpickle_traceback\x94\x93\x94\x8c\x05tblib\x94\x8c\x05Frame\x94\x93\x94)\x81\x94}\x94(\x8c\tf_globals\x94}\x94(\x8c\x08__name__\x94\x8c\x12distributed.worker\x94\x8c\x08__file__\x94\x8cH/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/worker.py\x94u\x8c\x06f_code\x94h\n\x8c\x04Code\x94\x93\x94)\x81\x94}\x94(\x8c\x0bco_filename\x94h\x14\x8c\x07co_name\x94\x8c\x10ensure_computing\x94ububM\xfe\th\n\x8c\tTraceback\x94\x93\x94)\x81\x94}\x94(\x8c\x08tb_frame\x94h\x0c)\x81\x94}\x94(h\x0f}\x94(h\x11h\x12h\x13h\x14uh\x15h\x17)\x81\x94}\x94(h\x1ah\x14h\x1b\x8c\x17_maybe_deserialize_task\x94ubub\x8c\ttb_lineno\x94M\xcc\t\x8c\x07tb_next\x94h\x1e)\x81\x94}\x94(h!h\x0c)\x81\x94}\x94(h\x0f}\x94(h\x11h\x12h\x13h\x14uh\x15h\x17)\x81\x94}\x94(h\x1ah\x14h\x1b\x8c\x0c_deserialize\x94ububh(M\x1d\rh)h\x1e)\x81\x94}\x94(h!h\x0c)\x81\x94}\x94(h\x0f}\x94(h\x11h\x12h\x13h\x14uh\x15h\x17)\x81\x94}\x94(h\x1ah\x14h\x1b\x8c\x0eloads_function\x94ububh(M\x14\rh)h\x1e)\x81\x94}\x94(h!h\x0c)\x81\x94}\x94(h\x0f}\x94(h\x11\x8c\x1bdistributed.protocol.pickle\x94h\x13\x8cQ/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/protocol/pickle.py\x94uh\x15h\x17)\x81\x94}\x94(h\x1ah@h\x1b\x8c\x05loads\x94ububh(KKubububub\x87\x94R\x94t\x94R\x94."
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/distributed/protocol/pickle.py", line 75, in loads
    return pickle.loads(x)
ValueError: unsupported pickle protocol: 5
distributed.protocol.core - CRITICAL - Failed to deserialize
...

FabioRosado commented 3 years ago

I just had a look and it seems that the exception that you have seen ValueError: unsupported pickle protocol: 5 refers to a mismatch of python versions. That notebook is using python 3.7 whilst the example notebook is running pyhton 3.8 that could be the reason why.

Perhaps if it's not much of a hassle, you could try to recreate that notebook locally or on a coiled notebook - that way the versions will match and things should run more smoothly.

murphyk commented 3 years ago

I would rather launch stuff from colab than from my laptop (since a single colab often suffices, but sometimes I want to run things in parallel). So I made a software environment containing python3.7 to match colab:

env = coiled.create_software_environment(
    name="pytorch-finetuning-37", 
    conda={
        "channels": ["conda-forge", "pytorch", "defaults"], 
        "dependencies": ["coiled=0.0.36", "dask-ml", "dask>=2.29.0", "matplotlib", "numpy", "pandas>=1.1.0", 
                         "python=3.7", # match colab
                         "pytorch>1.1.0", "s3fs", "scipy", "skorch"]
    }
)

I then create the cluser

cluster = coiled.Cluster(
    n_workers=2, #10 # use 2 to make startup time faster
    name = "pytorch-finetuning-37",
    software = "pytorch-finetuning-37"
)

and run some code

dmodel = // delayed model
batches = /// delayed version of local file loading
predictions = [predict(batch, dmodel) for batch in batches]
predictions = dask.compute(*predictions)

Now the error I get is the same as when I run your notebook on coiled, namely it cannot find the files, since they are local (in this case, to colab):


FileNotFoundError: [Errno 2] No such file or directory: 'hymenoptera_data/val/bees/2104135106_a65eede1de.jpg'

Source code: https://github.com/probml/pyprobml/blob/master/notebooks/coiled_pytorch_finetune.ipynb

FabioRosado commented 3 years ago

Are you able to send the file using the distributed client?

from distributed import Client

client = Client(cluster)

client.upload_file(<zip file>)

Reference: distributed.Client.upload_file

shughes-uk commented 1 year ago

Package sync should resolve these issues. Closing as stale as we have not heard from the user in quite some time