Closed scharlottej13 closed 1 year ago
Maybe you just need to rebuild the Coiled software environment? (If it's not rebuilding you could try using --force-rebuild
)
I see torchvision
in https://github.com/coiled/examples/blob/main/pytorch.yml#L13
but not listed in https://cloud.coiled.io/software/alias/26488/build/21361?account=sarah-johnson
Maybe you just need to rebuild the Coiled software environment? (If it's not rebuilding you could try using
--force-rebuild
)I see
torchvision
in https://github.com/coiled/examples/blob/main/pytorch.yml#L13but not listed in https://cloud.coiled.io/software/alias/26488/build/21361?account=sarah-johnson
Ah yeah that's it, thank you Nat! (I even looked at the pytorch env and just didn't see it there 🤦♀️)
@mrocklin @ntabris this is ready for review, thanks for your help!
one thing I'm kind of curious about w/ this example is if given the time it takes to move the data to the GPU, is it still faster to use a GPU vs. CPU for training? Here's the same example but w/ c6i.xlarge https://cloud.coiled.io/clusters/245459/information?account=sarah-johnson&tab=Metrics,
It would be nice to have an example that was more computationally intense. It's not necessarily worth spending a bunch of time on to make this great though. Maybe this gets the point across and other things would be higher value (or not, I'm a bit out of the loop about all that's going on)
On Thu, Jul 20, 2023, 6:50 PM Sarah Charlotte Johnson < @.***> wrote:
one thing I'm kind of curious about w/ this example is if given the time it takes to move the data to the GPU, is it still faster to use a GPU vs. CPU for training? Here's the same example but w/ c6i.xlarge https://cloud.coiled.io/clusters/245459/information?account=sarah-johnson&tab=Metrics ,
— Reply to this email directly, view it on GitHub https://github.com/coiled/examples/pull/22#issuecomment-1644738830, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTF53YQDGVCBTB32NLTXRGY4HANCNFSM6AAAAAA2PIPXTY . You are receiving this because you were mentioned.Message ID: @.***>
It would be nice to have an example that was more computationally intense. It's not necessarily worth spending a bunch of time on to make this great though. Maybe this gets the point across and other things would be higher value (or not, I'm a bit out of the loop about all that's going on)
I'm +1 on merging as is. It's a toy case (fashion mnist model) but I think the point we're trying to make with this example is that you can easily offload to GPU in the cloud, and also easily return your model back to local machine that doesn't have GPU.
It would be nice to have an example that was more computationally intense. It's not necessarily worth spending a bunch of time on to make this great though. Maybe this gets the point across and other things would be higher value (or not, I'm a bit out of the loop about all that's going on)
I'm +1 on merging as is. It's a toy case (fashion mnist model) but I think the point we're trying to make with this example is that you can easily offload to GPU in the cloud, and also easily return your model back to local machine that doesn't have GPU.
Sounds good, I'm going to merge this then!
Adding an example that trains and returns a model (see https://github.com/coiled/examples/pull/20#issuecomment-1629434795)
This is close, but I'm having some deserialization issues. Explaining this in terms of the two new files:
run/pytorch-test.py
this works! This is a good minimal example of how to return a model from a function running on a remote GPU, save it locally, and then load the CPU version.run/pytorch-train.py
this is the real example. I'm getting a deserialization error, and I think it's related to loading the mnist dataset, since the traceback includes a ModuleNotFoundError for torchvision (cluster here), full traceback:Any debugging tips would be much appreciated! cc @mrocklin @ntabris