Yelp / mrjob

Run MapReduce jobs on Hadoop or Amazon Web Services
http://packages.python.org/mrjob/
Other
2.61k stars 587 forks source link

mrjob audit-emr-usage should show bootstrap savings due to cluster sharing #1814

Open coyotemarin opened 6 years ago

coyotemarin commented 6 years ago

Now that Amazon bills by the second rather than the full hour, cluster pooling is not usually a good way to save money. However, it does save you from having to run your bootstrap script (which you have to pay for) again.

When a cluster runs multiple jobs, mrjob audit-emr-usage should track how much time was saved by not having to re-run the bootstrap script, and subtract that from idle time to determine waste (this may be negative, in which case pooling is saving the user money).

This also applies to persistent clusters that people run multiple jobs on manually.

coyotemarin commented 6 years ago

Shoot, currently the script doesn't distinguish time spent provisioning the cluster (STARTING state) from time bootstrapping it. This isn't available from DescribeClusters — maybe there's some other way to get that information?

coyotemarin commented 6 years ago

ListInstances shows the same ReadyDateTime as the cluster.

coyotemarin commented 6 years ago

okay, looks like you use ListInstances and then the EC2 API's DescribeInstances and look at the LaunchTime for each instance in the cluster. It's probably close enough to consider billing to start either at the last LaunchTime before the cluster's ReadyDateTime or 10 minutes after the cluster's CreationDateTime, whichever comes first.