coiled / examples

Examples using Dask and Coiled
17 stars 3 forks source link

Add PyTorch example returning trained model #22

Closed scharlottej13 closed 1 year ago

scharlottej13 commented 1 year ago

Adding an example that trains and returns a model (see https://github.com/coiled/examples/pull/20#issuecomment-1629434795)

This is close, but I'm having some deserialization issues. Explaining this in terms of the two new files:

ModuleNotFoundError                       Traceback (most recent call last)
File /opt/coiled/env/lib/python3.11/site-packages/distributed/scheduler.py:4297, in update_graph()

File /opt/coiled/env/lib/python3.11/site-packages/distributed/protocol/serialize.py:432, in deserialize()

File /opt/coiled/env/lib/python3.11/site-packages/distributed/protocol/serialize.py:98, in pickle_loads()

File /opt/coiled/env/lib/python3.11/site-packages/distributed/protocol/pickle.py:96, in loads()

File /opt/coiled/env/lib/python3.11/site-packages/cloudpickle/cloudpickle.py:649, in subimport()

ModuleNotFoundError: No module named 'torchvision'

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
Cell In[26], line 1
----> 1 model = train_all_epochs()

File ~/mambaforge/envs/pytorch/lib/python3.11/site-packages/coiled/run.py:62, in Function.__call__(self, *args, **kwargs)
     61 def __call__(self, *args, **kwargs):
---> 62     return self.client.submit(self.function, *args, **kwargs).result()

File ~/mambaforge/envs/pytorch/lib/python3.11/site-packages/distributed/client.py:319, in Future.result(self, timeout)
    317 if self.status == "error":
    318     typ, exc, tb = result
--> 319     raise exc.with_traceback(tb)
    320 elif self.status == "cancelled":
    321     raise result

RuntimeError: Error during deserialization of the task graph. This frequently occurs if the Scheduler and Client have different environments. For more information, see https://docs.dask.org/en/stable/deployment-considerations.html#consistent-software-environments

Any debugging tips would be much appreciated! cc @mrocklin @ntabris

ntabris commented 1 year ago

Maybe you just need to rebuild the Coiled software environment? (If it's not rebuilding you could try using --force-rebuild)

I see torchvision in https://github.com/coiled/examples/blob/main/pytorch.yml#L13

but not listed in https://cloud.coiled.io/software/alias/26488/build/21361?account=sarah-johnson

scharlottej13 commented 1 year ago

Maybe you just need to rebuild the Coiled software environment? (If it's not rebuilding you could try using --force-rebuild)

I see torchvision in https://github.com/coiled/examples/blob/main/pytorch.yml#L13

but not listed in https://cloud.coiled.io/software/alias/26488/build/21361?account=sarah-johnson

Ah yeah that's it, thank you Nat! (I even looked at the pytorch env and just didn't see it there 🤦‍♀️)

scharlottej13 commented 1 year ago

@mrocklin @ntabris this is ready for review, thanks for your help!

scharlottej13 commented 1 year ago

one thing I'm kind of curious about w/ this example is if given the time it takes to move the data to the GPU, is it still faster to use a GPU vs. CPU for training? Here's the same example but w/ c6i.xlarge https://cloud.coiled.io/clusters/245459/information?account=sarah-johnson&tab=Metrics,

mrocklin commented 1 year ago

It would be nice to have an example that was more computationally intense. It's not necessarily worth spending a bunch of time on to make this great though. Maybe this gets the point across and other things would be higher value (or not, I'm a bit out of the loop about all that's going on)

On Thu, Jul 20, 2023, 6:50 PM Sarah Charlotte Johnson < @.***> wrote:

one thing I'm kind of curious about w/ this example is if given the time it takes to move the data to the GPU, is it still faster to use a GPU vs. CPU for training? Here's the same example but w/ c6i.xlarge https://cloud.coiled.io/clusters/245459/information?account=sarah-johnson&tab=Metrics ,

— Reply to this email directly, view it on GitHub https://github.com/coiled/examples/pull/22#issuecomment-1644738830, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTF53YQDGVCBTB32NLTXRGY4HANCNFSM6AAAAAA2PIPXTY . You are receiving this because you were mentioned.Message ID: @.***>

ntabris commented 1 year ago

It would be nice to have an example that was more computationally intense. It's not necessarily worth spending a bunch of time on to make this great though. Maybe this gets the point across and other things would be higher value (or not, I'm a bit out of the loop about all that's going on)

I'm +1 on merging as is. It's a toy case (fashion mnist model) but I think the point we're trying to make with this example is that you can easily offload to GPU in the cloud, and also easily return your model back to local machine that doesn't have GPU.

scharlottej13 commented 1 year ago

It would be nice to have an example that was more computationally intense. It's not necessarily worth spending a bunch of time on to make this great though. Maybe this gets the point across and other things would be higher value (or not, I'm a bit out of the loop about all that's going on)

I'm +1 on merging as is. It's a toy case (fashion mnist model) but I think the point we're trying to make with this example is that you can easily offload to GPU in the cloud, and also easily return your model back to local machine that doesn't have GPU.

Sounds good, I'm going to merge this then!