Training/Billable seconds

dask / dask-cloudprovider

Cloud provider cluster managers for Dask. Supports AWS, Google Cloud Azure and more...

https://cloudprovider.dask.org

BSD 3-Clause "New" or "Revised" License

134 stars 110 forks source link

Training/Billable seconds #120

Open rpanai opened 4 years ago

rpanai commented 4 years ago

With SageMaker at the end of the training are printed the following two lines:

Training seconds: X : This is the actual compute-time your training job spent
Billable seconds: Y : This is the time you will be billed for after Spot discounting is applied.

I think it will be nice to have something similar after we run a computation with dask. What I mean is:

Total time: $T=t_{w0}+\ldots+t{wn}$ where $t_{w_i}$ is the total time in seconds worker $w_i$ was on.
Cost estimative: which is just T/60**2 * worker_hourly_cost

jacobtomlinson commented 4 years ago

Thanks for raising this @rpanai.

This project grew from this notebook that I created a few years ago. The original notebook did include a cost estimate secion based on Fargate metrics.

Fargate costs are easier to estimate because the service is billed per second, the API gives exact billing times and costs and there is no multitenancy.

When using a service like ECS where you are billed for container instances independently it is much harder to provide cost information. When packing work onto instances there will be wastage. Do we include that in our estimates or not?

I decided not to include this feature initially because it felt complex to implement consistently and I didn't want to give users false information. However perhaps for services like Fargate we could introduce it.

rpanai commented 4 years ago

Hi @jacobtomlinson, I understand your point. I'd say that where the estimate is easy/reliable it will be a nice thing to have. Let me know if I can help somehow.

jacobtomlinson commented 4 years ago

If you want to raise a PR where you take the cost estimate logic from the notebook I linked and add it to the FargateCluster class as a method with a name like estimate_cost() I think that would be great.

rpanai commented 1 year ago

Hi Jacob, I finally tried to work on this and I found it's not working as I wished/expected in particular when I'm using an adaptive cluster. I think it's something should be moved to dask.

My final goal is to have some cost estimative for every single run. As example if I have an adaprive cluster and I connect to it from 2 different scripts I'd like to know how much it cost each given operation.

Lets say I've

time
import time
def fun(x):
    time.sleep(1)
    return x**2

npartitions  = 5
b = db.from_sequence(
    list(range(200)),
    npartitions=npartitions)\
    .map(lambda x: fun(x))
out = b.compute()

I'd like to know how much this costs. But here Im not sure if db know somehow how many workers I am using and for how long.

Do you think its possible to achieve something in this direction?

jacobtomlinson commented 1 year ago

The scheduler should know this information, rather than db. I wonder if we could capture that can estimate the costs from it?

rpanai commented 1 year ago

Who do you think is the best person to ask? I made a decorator to get duration and max ram usage for a function but, as you said, the scheduler have all these infos.

jacobtomlinson commented 1 year ago

I recommend you explore the performance_report code in distributed because that records a lot of what is going on in the cluster. That could be a good place to get the value for $T$.