Open rsignell opened 4 months ago
Thanks @rsignell. Looking at the traceback, I think this is probably on our end. I'll take closer look
We have some code collection logic that doesn't handle functools.partial
objects well. I'll push a fix up for that
In the meantime, I'm not sure where the partial is coming from inside cubed
. Unpacking the original function by getting the .func
attribute on the partial
-wrapped function could be a temporary workaround
Unpacking the original function by getting the .func attribute on the partial-wrapped function could be a temporary workaround
Is this a workaround I could implement? I'm afraid I don't understand this! 😕
Ah, sorry, I meant if you're playing around with a development version of cubed
, then you could make some changes to fix things. We should have a fix out in coiled
soon, so it may be faster to just wait for that
We'll likely have a new release out sometime later today or tomorrow, but if you want to try a dev coiled
release with the fix you can pip install coiled==1.3.12.dev10
Thanks @jrbourbeau!
Both tests ran successfully with pip install coiled==1.3.12.dev12
.
Here's the results of the larger coiled-add-random.py
test:
python coiled-add-random.py "s3://esip-qhub/rsignell/testing/cubed"
@tomwhite or @TomNicholas does the Coiled performance here appear (from the screenshots above) to be comparable to other executors? If not, I'm hoping to work with Coiled engineers to see if they have ideas about how to improve performance.
Thanks for looking at this @rsignell, and for the bugfix @jrbourbeau.
I tried the Coiled executor a while back and had a similar experience with the number of instances not scaling as much as it could have done (see https://github.com/tomwhite/cubed/pull/309). It would be great if Coiled could ramp up the number of instances faster (is it possible to tell it to do that?). Other executors like Lithops or Modal can scale up to 100s or 1000s of instances in seconds.
We don't yet have good benchmarks so it's hard to know what the total time should be (and a lot has changed since I last ran anything at scale), but here's a way to get an idea of what the performance should be.
If I attempt to run this example locally[^1] (just for demonstration purposes) then you see a progress bar for each step in the plan (i.e. each intermediate array that needs to be written to zarr).
Ignoring the "create arrays", there are 3 steps, each of which consists of 100 embarrassingly parallel tasks (each is a chunk to be saved to an intermediate zarr store). You can see them laid out in the plan:
If you don't have 100 chunks being computed in parallel on Coiled between each step then it could go faster.[^2] If you were to run this on Lambda/GCF then that's what you would see.
Assuming your coiled cluster is computing one chunk per worker, it looks like you're only computing 10 chunks at once, so it could be at least 10x faster (before any other graph optimizations). The bottleneck here is the dynamic scaling up of the Coiled Cluster to match the very wide task graphs that Cubed is passing it.
[^1]: There's probably a neater way to get to this information that I'm blanking on right now. EDIT: I could simply have hovered over the graph nodes in jupyter to see the number of tasks for that node.
[^2]: In fact I think since the first two arrays are independent, and since https://github.com/tomwhite/cubed/issues/80 was fixed, the first part of the calculation should be able to compute 200 chunks at once.
Coiled uses raw VMs. This means that things are super-cheap, but also that there's a minute-long cold startup time one has to deal with (systems like Modal get around this by having VMs up all the time, but this is expensive and assumes that you're in their region/account. This gets rolled into the 10x cost difference).
Because things have minute-long startup times, the adaptive settings are configured to scale up to something large enough so that it completes in ~5 minutes. If we're going to spend one minute spinning up a VM, we want that VM to be active for at least a couple more minutes before spinning it down.
You can tweak those configurations if you like, but you'll never get sub-minute with Coiled. There's a cost-speed tradeoff here.
https://docs.coiled.io/user_guide/clusters/scale.html#adaptive-scaling adaptive scaling docs if you want to tweak some settings.
I suspect Cubed can probably make better decisions about when to scale in/out though, and just achieve that through manual scale
calls.
I suspect Cubed can probably make better decisions about when to scale in/out though, and just achieve that through manual
scale
calls.
Yes, we could pre-scale to the maximum width of the graph.
I was planning on using this for rechunking a large Zarr dataset (reading 1TB Zarr from S3, writing a new rechunked 1 TB Zarr to S3). For this use case I don't care about super fast autoscaling, do I?
Yes, we could pre-scale to the maximum width of the graph
Cost-wise this might be bad, especially if the tasks run quickly. If you spend a minute spinning up VMs that then just run for a second, then you're spending the vast majority of VM-rental-time doing nothing other than setting up.
I was planning on using this for rechunking a large Zarr dataset (reading 1TB Zarr from S3, writing a new rechunked 1 TB Zarr to S3). For this use case I don't care about super fast autoscaling, do I?
Not at all. This is a case where Coiled's approach of trading slow spin-up time for dirt cheap spot VMs is ideal. You're spending $0.02 per CPU hour vs $0.20 per CPU hour, as long as you're fine waiting a few minutes for everything to complete.
I tried following the instructions to run a Cubed demo using Coiled on AWS following these instructions, and the environment creation went smoothly:
but when I tried running the example:
I got:
I'm raising the issue here as I'm guessing something on the Coiled side changed?
But of course I don't really know so I'm prepared to be told to post it over at cubed... :)