Add Basic PyTorch example

mrocklin commented 1 year ago

I ran into a bunch of issues with environments and config. What's here works though. It's not faster than CPUs though, mostly because data loading is more expensive in this example than training.

Loom video with some thoughts: https://www.loom.com/share/0c38fdb3bd334756b49df80d301102ea

Some issues:

dask 2023.03 doesn't work with optuna
dask-cuda doesn't work with dask 2023.03 or pandas 2. Ran into other issues. I eventually abandoned this and just specified nthreads=1
Needed pytorch channel to get recent torchvision (older versions don't work with pytorch 2.0)

cc @jrbourbeau @ntabris @jacobtomlinson

review-notebook-app[bot] commented 1 year ago

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

mrocklin commented 1 year ago

Just adding evidence of "we're not really using the GPU here"

mrocklin commented 1 year ago

Also pinging @mccarty in case he knows of nicer examples than this or of anyone who might be able to answer that question well. It would be really nice to have something that really shows off the cost benefits of GPUs here.

ntabris commented 1 year ago

Addressing the "I only got 6 of 10 workers" issue: I've found that GPU availability is better in us-west-2. Just tried this and got all the requested workers...

coiled.Cluster(
  worker_vm_types="g4dn.xlarge",
  n_workers=10,
  account="dask-engineering",
  backend_options={"region_name":"us-west-2"}
)

(I tried to get 30 g4dn.xlarge in us-west-2 and got 16, so still pretty constrained, but I think less so than us-east regions)

mrocklin commented 1 year ago

Sounds good. Thanks Nat.

On Mon, Apr 10, 2023 at 3:28 PM Nat Tabris @.***> wrote:

Addressing the "I only got 6 of 10 workers" issue: I've found that GPU availability is better in us-west-2. Just tried this and got all the requested workers...

coiled.Cluster( worker_vm_types="g4dn.xlarge", n_workers=10, account="dask-engineering", backend_options={"region_name":"us-west-2"} )

(I tried to get 30 g4dn.xlarge in us-west-2 and got 16, so still pretty constrained, but I think less so than us-east regions)

— Reply to this email directly, view it on GitHub https://github.com/coiled/examples/pull/5#issuecomment-1502280385, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTCM7HLU5JCWJTD4RADXARUPVANCNFSM6AAAAAAWZJLYCM . You are receiving this because you authored the thread.Message ID: @.***>

--

https://coiled.io

Matthew Rocklin CEO

charlesbluca commented 1 year ago

dask-cuda doesn't work with dask 2023.03 or pandas 2. Ran into other issues. I eventually abandoned this and just specified nthreads=1

Interested in if things broke here at the environment solve level or later on? I was able to get the RAPIDS environment you showed in the loom solved locally (I switched over to using 23.04 nightlies so we could use unpinned Dask), though I imagine if we needed to pull in cuDF that would introduce a pandas<2 constraint.

Also pinging @mccarty in case he knows of nicer examples than this

cc @mmccarty

mrocklin commented 1 year ago

Interested in if things broke here at the environment solve level or later on?

Later on. For example it would bring in versions of libraries, like pandas 2.0, that didn't work when I went to run things. This happened a few times and so I eventually just moved on.

dchudz commented 1 year ago

It makes me sad that our product didn't make it easier to know that AWS availability was the issue. But I'm guessing that a user who cared how many instances they got might have gotten further.

Matt, this is an example where a user might care about that infrastructure details page that you always say only platform engineers care about. I think the info isn't really exposed elsewhere, though maybe it should be.

On Mon, Apr 10, 2023, 4:43 PM Matthew Rocklin @.***> wrote:

Sounds good. Thanks Nat.

On Mon, Apr 10, 2023 at 3:28 PM Nat Tabris @.***> wrote:

Addressing the "I only got 6 of 10 workers" issue: I've found that GPU availability is better in us-west-2. Just tried this and got all the requested workers...

coiled.Cluster( worker_vm_types="g4dn.xlarge", n_workers=10, account="dask-engineering", backend_options={"region_name":"us-west-2"} )

(I tried to get 30 g4dn.xlarge in us-west-2 and got 16, so still pretty constrained, but I think less so than us-east regions)

— Reply to this email directly, view it on GitHub https://github.com/coiled/examples/pull/5#issuecomment-1502280385, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AACKZTCM7HLU5JCWJTD4RADXARUPVANCNFSM6AAAAAAWZJLYCM

. You are receiving this because you authored the thread.Message ID: @.***>

--

https://coiled.io

Matthew Rocklin CEO

— Reply to this email directly, view it on GitHub https://github.com/coiled/examples/pull/5#issuecomment-1502295647, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJKQRQQWK3H5CXWNTYCAZDXARWHRANCNFSM6AAAAAAWZJLYCM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

mmccarty commented 1 year ago

Hey @mrocklin Sorry I'm late to the party here. I'll play around with this example and see what can be done to show off GPUs more effectively. Also, note that Pandas 2 support is in the works.

mrocklin commented 1 year ago

Thanks Mike!

On Wed, Apr 12, 2023 at 8:42 AM Mike McCarty @.***> wrote:

Hey @mrocklin https://github.com/mrocklin Sorry I'm late to the party here. I'll play around with this example and see what can be done to show off GPUs more effectively. Also, note that Pandas 2 support is in the works.

— Reply to this email directly, view it on GitHub https://github.com/coiled/examples/pull/5#issuecomment-1505302470, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTGWXVSH7JMCT2WVMQDXA2WMVANCNFSM6AAAAAAWZJLYCM . You are receiving this because you were mentioned.Message ID: @.***>

--

https://coiled.io

Matthew Rocklin CEO

coiled / examples

Add Basic PyTorch example #5