Cubed example using Coiled on AWS

rsignell commented 4 months ago

I tried following the instructions to run a Cubed demo using Coiled on AWS following these instructions, and the environment creation went smoothly:

mamba create -n cubed-coiled-examples python=3.9 pip -y
conda activate cubed-coiled-examples
pip install 'cubed[coiled]'

but when I tried running the example:

git clone git@github.com:tomwhite/cubed.git
cd cubed/examples/coiled/aws
export AWS_PROFILE=esip-qhub
python coiled-add-asarray.py "s3://esip-qhub/rsignell/testing/cubed"

I got:

Traceback (most recent call last):
  File "/home/rsignell/repos/cubed/examples/coiled/aws/coiled-add-asarray.py", line 22, in <module>
    res = c.compute(
  File "/home/rsignell/miniforge3/envs/cubed-coiled-examples/lib/python3.9/site-packages/cubed/core/array.py", line 136, in compute
    result = compute(
  File "/home/rsignell/miniforge3/envs/cubed-coiled-examples/lib/python3.9/site-packages/cubed/core/array.py", line 225, in compute
    plan.execute(
  File "/home/rsignell/miniforge3/envs/cubed-coiled-examples/lib/python3.9/site-packages/cubed/core/plan.py", line 171, in execute
    executor.execute_dag(
  File "/home/rsignell/miniforge3/envs/cubed-coiled-examples/lib/python3.9/site-packages/cubed/runtime/executors/coiled.py", line 31, in execute_dag
    coiled_function = make_coiled_function(pipeline.function, coiled_kwargs)
  File "/home/rsignell/miniforge3/envs/cubed-coiled-examples/lib/python3.9/site-packages/cubed/runtime/executors/coiled.py", line 13, in make_coiled_function
    return coiled.function(**coiled_kwargs)(execution_stats(func))
  File "/home/rsignell/miniforge3/envs/cubed-coiled-examples/lib/python3.9/site-packages/coiled/function.py", line 404, in decorator
    Function(func, cluster_kwargs, keepalive=keepalive, environ=environ, local=local, name=name)
  File "/home/rsignell/miniforge3/envs/cubed-coiled-examples/lib/python3.9/site-packages/coiled/function.py", line 49, in __init__
    self._code = inspect.getsource(function)
  File "/home/rsignell/miniforge3/envs/cubed-coiled-examples/lib/python3.9/inspect.py", line 1024, in getsource
    lines, lnum = getsourcelines(object)
  File "/home/rsignell/miniforge3/envs/cubed-coiled-examples/lib/python3.9/inspect.py", line 1006, in getsourcelines
    lines, lnum = findsource(object)
  File "/home/rsignell/miniforge3/envs/cubed-coiled-examples/lib/python3.9/inspect.py", line 817, in findsource
    file = getsourcefile(object)
  File "/home/rsignell/miniforge3/envs/cubed-coiled-examples/lib/python3.9/inspect.py", line 697, in getsourcefile
    filename = getfile(object)
  File "/home/rsignell/miniforge3/envs/cubed-coiled-examples/lib/python3.9/inspect.py", line 677, in getfile
    raise TypeError('module, class, method, function, traceback, frame, or '
TypeError: module, class, method, function, traceback, frame, or code object was expected, got partial

I'm raising the issue here as I'm guessing something on the Coiled side changed?
But of course I don't really know so I'm prepared to be told to post it over at cubed... :)

jrbourbeau commented 4 months ago

Thanks @rsignell. Looking at the traceback, I think this is probably on our end. I'll take closer look

jrbourbeau commented 4 months ago

We have some code collection logic that doesn't handle functools.partial objects well. I'll push a fix up for that

In the meantime, I'm not sure where the partial is coming from inside cubed. Unpacking the original function by getting the .func attribute on the partial-wrapped function could be a temporary workaround

rsignell commented 4 months ago

Unpacking the original function by getting the .func attribute on the partial-wrapped function could be a temporary workaround

Is this a workaround I could implement? I'm afraid I don't understand this! 😕

jrbourbeau commented 4 months ago

Ah, sorry, I meant if you're playing around with a development version of cubed, then you could make some changes to fix things. We should have a fix out in coiled soon, so it may be faster to just wait for that

jrbourbeau commented 4 months ago

We'll likely have a new release out sometime later today or tomorrow, but if you want to try a dev coiled release with the fix you can pip install coiled==1.3.12.dev10

rsignell commented 4 months ago

Thanks @jrbourbeau!

Both tests ran successfully with pip install coiled==1.3.12.dev12.

Here's the results of the larger coiled-add-random.py test:

python coiled-add-random.py "s3://esip-qhub/rsignell/testing/cubed"

Screenshot 2024-02-01 070442 Screenshot 2024-02-01 070525

rsignell commented 4 months ago

@tomwhite or @TomNicholas does the Coiled performance here appear (from the screenshots above) to be comparable to other executors? If not, I'm hoping to work with Coiled engineers to see if they have ideas about how to improve performance.

tomwhite commented 4 months ago

Thanks for looking at this @rsignell, and for the bugfix @jrbourbeau.

I tried the Coiled executor a while back and had a similar experience with the number of instances not scaling as much as it could have done (see https://github.com/tomwhite/cubed/pull/309). It would be great if Coiled could ramp up the number of instances faster (is it possible to tell it to do that?). Other executors like Lithops or Modal can scale up to 100s or 1000s of instances in seconds.

TomNicholas commented 4 months ago

We don't yet have good benchmarks so it's hard to know what the total time should be (and a lot has changed since I last ran anything at scale), but here's a way to get an idea of what the performance should be.

If I attempt to run this example locally[^1] (just for demonstration purposes) then you see a progress bar for each step in the plan (i.e. each intermediate array that needs to be written to zarr).

Ignoring the "create arrays", there are 3 steps, each of which consists of 100 embarrassingly parallel tasks (each is a chunk to be saved to an intermediate zarr store). You can see them laid out in the plan:

If you don't have 100 chunks being computed in parallel on Coiled between each step then it could go faster.[^2] If you were to run this on Lambda/GCF then that's what you would see.

Assuming your coiled cluster is computing one chunk per worker, it looks like you're only computing 10 chunks at once, so it could be at least 10x faster (before any other graph optimizations). The bottleneck here is the dynamic scaling up of the Coiled Cluster to match the very wide task graphs that Cubed is passing it.

[^1]: There's probably a neater way to get to this information that I'm blanking on right now. EDIT: I could simply have hovered over the graph nodes in jupyter to see the number of tasks for that node.

[^2]: In fact I think since the first two arrays are independent, and since https://github.com/tomwhite/cubed/issues/80 was fixed, the first part of the calculation should be able to compute 200 chunks at once.

mrocklin commented 4 months ago

Coiled uses raw VMs. This means that things are super-cheap, but also that there's a minute-long cold startup time one has to deal with (systems like Modal get around this by having VMs up all the time, but this is expensive and assumes that you're in their region/account. This gets rolled into the 10x cost difference).

Because things have minute-long startup times, the adaptive settings are configured to scale up to something large enough so that it completes in ~5 minutes. If we're going to spend one minute spinning up a VM, we want that VM to be active for at least a couple more minutes before spinning it down.

You can tweak those configurations if you like, but you'll never get sub-minute with Coiled. There's a cost-speed tradeoff here.

shughes-uk commented 4 months ago

https://docs.coiled.io/user_guide/clusters/scale.html#adaptive-scaling adaptive scaling docs if you want to tweak some settings.

I suspect Cubed can probably make better decisions about when to scale in/out though, and just achieve that through manual scale calls.

tomwhite commented 4 months ago

I suspect Cubed can probably make better decisions about when to scale in/out though, and just achieve that through manual scale calls.

Yes, we could pre-scale to the maximum width of the graph.

rsignell commented 4 months ago

I was planning on using this for rechunking a large Zarr dataset (reading 1TB Zarr from S3, writing a new rechunked 1 TB Zarr to S3). For this use case I don't care about super fast autoscaling, do I?

mrocklin commented 4 months ago

Yes, we could pre-scale to the maximum width of the graph

Cost-wise this might be bad, especially if the tasks run quickly. If you spend a minute spinning up VMs that then just run for a second, then you're spending the vast majority of VM-rental-time doing nothing other than setting up.

I was planning on using this for rechunking a large Zarr dataset (reading 1TB Zarr from S3, writing a new rechunked 1 TB Zarr to S3). For this use case I don't care about super fast autoscaling, do I?

Not at all. This is a case where Coiled's approach of trading slow spin-up time for dirt cheap spot VMs is ideal. You're spending $0.02 per CPU hour vs $0.20 per CPU hour, as long as you're fine waiting a few minutes for everything to complete.

coiled / feedback

Cubed example using Coiled on AWS #271