Open TomNicholas opened 9 months ago
Duplicate of #219?
I had forgotten about #219, but actually I don't think this is a duplicate - I'm suggesting warning users of estimated costs before execution, whereas #219 seems to be about actual cost after execution. Though I imagine you could re-use much code when calculating both numbers.
Basically I think it would be useful for users to be able to see "hang on, this isn't supposed to cost that much, maybe I've not expressed the analysis I meant to..." before they actually waste that money.
Sounds good - let's keep both open.
Cubed arguably has enough information to give a rough estimate of the monetary cost of executing the plan before starting execution.
I'm imagining a new method
.estimate_cost(executor)
that is similar to.compute(executor)
. Calling this we would knowPlan
object,Executor
passed,Spec
object,It would just print an estimation of the cost back to the user without running anything, and maybe raise warnings if they are planning to do something that seems obviously expensive (e.g. like having their temporary bucket for intermediate data be in AWS but their executor be GCF).
This means if we had a little table somewhere of e.g. AWS lambda and S3 prices, Cubed could consult those numbers and sum them. It would require an idea of e.g. how long it takes to run
np.mean()
on a chunk of a certain size on a certain container, but this seems like something that can be discovered fairly straightforwardly.Obviously there are a long tail of cases where this wouldn't work, but often you might still be able to provide a lower bound cost estimate. For example if your plan had a step that applied some arbitrary function with
apply_gufunc
, cubed would not know if that was some super expensive function that would run for ever, but it would still be possible to estimate the minimum cost assuming that that function was very light.