dask / dask-ec2

Start a cluster in EC2 for dask.distributed
106 stars 37 forks source link

Consider cheaper EC2 instance type default value #36

Open deeplook opened 7 years ago

deeplook commented 7 years ago

Given that it is quite easy to forget about destroying an existing cluster (see #35) I would consider it important to set an EC2 instance type as a default value, that is much cheaper than m3.2xlarge. Otherwise one can get unexpected significant increases in one's AWS invoice, like it happened to me, because of experimenting with dask-ec2. In my case this was ca. 96 CPU hours and $60 per day! :-(

This (and other values) should be listed in the generated cluster.yml file, too. So there is some evidence without needing to go to the AWS online console.

mrocklin commented 7 years ago

Oof, sorry to hear about the unexpected bill.

There is a trade-off here with having default nodes that are still of a relevant size for computing. An m.2xlarge is about the size of a modern laptop, so it's nice to have a few of them to get a sense of scale. I can't think of any full solutions, but there are probably some things that could help:

  1. Visibly publish an expected hourly or daily cost. This requires having a copy of Amazon's costs-per-node within the project, which will inevitably go stale or out of date, but is still probably handy.
  2. Is it possible to launch an EC2 node with a time-to-live?
  3. Is it possible to add an ability to query if the cluster in the cluster.yaml file is still active?

Other thoughts?

deeplook commented 7 years ago

@mrocklin Ad 1: Sadly, it seems like AWS APIs were providing pricing info only for spot price history. I've found this online page for on-demand pricing (which probably applies here): https://aws.amazon.com/ec2/pricing/on-demand/ Others can be found here: https://aws.amazon.com/ec2/pricing/

I've tried scraping it and got it eventually mostly working only when using Selenium (not Pandas, Lxml or BeautifulSoup; maybe there's too much JS in it). Now Selenium is likely not a great dependency to have for dask-ec2 (especially in combination with PhantomJS for a headless browser). But then, one would not expect this pricing info to change a lot. So, maybe hardcoding such a current pricing table (ignoring data transfer pricing) with every build of dask-ec2 might be an idea. Since this is different for every AWS region one would need a list of them... I get something like this for Linux instances in us-east-1 right now:

type vCPU ECU Memory (GiB) Instance Storage (GB) Linux/UNIX Usage
t2.nano 1 Variable 0.5 EBS Only $0.0065 per Hour
t2.micro 1 Variable 1 EBS Only $0.013 per Hour
t2.small 1 Variable 2 EBS Only $0.026 per Hour
t2.medium 2 Variable 4 EBS Only $0.052 per Hour
t2.large 2 Variable 8 EBS Only $0.104 per Hour
m4.large 2 6.5 8 EBS Only $0.12 per Hour
m4.xlarge 4 13 16 EBS Only $0.239 per Hour
m4.2xlarge 8 26 32 EBS Only $0.479 per Hour
m4.4xlarge 16 53.5 64 EBS Only $0.958 per Hour
m4.10xlarge 40 124.5 160 EBS Only $2.394 per Hour
m4.16xlarge 64 188 256 EBS Only $3.83 per Hour
m3.medium 1 3 3.75 1 x 4 SSD $0.067 per Hour
m3.large 2 6.5 7.5 1 x 32 SSD $0.133 per Hour
m3.xlarge 4 13 15 2 x 40 SSD $0.266 per Hour
m3.2xlarge 8 26 30 2 x 80 SSD $0.532 per Hour
c4.large 2 8 3.75 EBS Only $0.105 per Hour
c4.xlarge 4 16 7.5 EBS Only $0.209 per Hour
c4.2xlarge 8 31 15 EBS Only $0.419 per Hour
c4.4xlarge 16 62 30 EBS Only $0.838 per Hour
c4.8xlarge 36 132 60 EBS Only $1.675 per Hour
c3.large 2 7 3.75 2 x 16 SSD $0.105 per Hour
c3.xlarge 4 14 7.5 2 x 40 SSD $0.21 per Hour
c3.2xlarge 8 28 15 2 x 80 SSD $0.42 per Hour
c3.4xlarge 16 55 30 2 x 160 SSD $0.84 per Hour
c3.8xlarge 32 108 60 2 x 320 SSD $1.68 per Hour
p2.xlarge 4 12 61 EBS Only $0.9 per Hour
p2.8xlarge 32 94 488 EBS Only $7.2 per Hour
p2.16xlarge 64 188 732 EBS Only $14.4 per Hour
g2.2xlarge 8 26 15 60 SSD $0.65 per Hour
g2.8xlarge 32 104 60 2 x 120 SSD $2.6 per Hour
x1.16xlarge 64 174.5 976 1 x 1920 SSD $6.669 per Hour
x1.32xlarge 128 349 1952 2 x 1920 SSD $13.338 per Hour
r3.large 2 6.5 15 1 x 32 SSD $0.166 per Hour
r3.xlarge 4 13 30.5 1 x 80 SSD $0.333 per Hour
r3.2xlarge 8 26 61 1 x 160 SSD $0.665 per Hour
r3.4xlarge 16 52 122 1 x 320 SSD $1.33 per Hour
r3.8xlarge 32 104 244 2 x 320 SSD $2.66 per Hour
i2.xlarge 4 14 30.5 1 x 800 SSD $0.853 per Hour
i2.2xlarge 8 27 61 2 x 800 SSD $1.705 per Hour
i2.4xlarge 16 53 122 4 x 800 SSD $3.41 per Hour
i2.8xlarge 32 104 244 8 x 800 SSD $6.82 per Hour
d2.xlarge 4 14 30.5 3 x 2000 HDD $0.69 per Hour
d2.2xlarge 8 28 61 6 x 2000 HDD $1.38 per Hour
d2.4xlarge 16 56 122 12 x 2000 HDD $2.76 per Hour
d2.8xlarge 36 116 244 24 x 2000 HDD $5.52 per Hour

I could contribute a code snippet (after finalising it) which you could include in your build process if this is what should finally happen...

mrocklin commented 7 years ago

I suspect we could also do a decent job with just a static (stale) copy of this information.

On Thu, Nov 3, 2016 at 11:18 AM, deeplook notifications@github.com wrote:

@mrocklin https://github.com/mrocklin Ad 1: Sadly, it seems like AWS APIs were providing pricing info only for spot price history. I've found this online page for on-demand pricing (which probably applies here): https://aws.amazon.com/ec2/pricing/on-demand/ Others can be found here: https://aws.amazon.com/ec2/pricing/

I've tried scraping it and got it eventually mostly working only when using Selenium (not Pandas, Lxml or BeautifulSoup; maybe there's too much JS in it). Now Selenium is likely not a great dependency to have for dask-ec2 (especially in combination with PhantomJS for a headless browser). But then, one would not expect this pricing info to change a lot. So, maybe hardcoding such a current pricing table (ignoring data transfer pricing) with every build of dask-ec2 might be an idea. Since this is different for every AWS region one would need a list of them... I get something like this for Linux instances in us-east-1 right now:

type vCPU ECU Memory (GiB) Instance Storage (GB) Linux/UNIX Usage t2.nano 1 Variable 0.5 EBS Only $0.0065 per Hour t2.micro 1 Variable 1 EBS Only $0.013 per Hour t2.small 1 Variable 2 EBS Only $0.026 per Hour t2.medium 2 Variable 4 EBS Only $0.052 per Hour t2.large 2 Variable 8 EBS Only $0.104 per Hour m4.large 2 6.5 8 EBS Only $0.12 per Hour m4.xlarge 4 13 16 EBS Only $0.239 per Hour m4.2xlarge 8 26 32 EBS Only $0.479 per Hour m4.4xlarge 16 53.5 64 EBS Only $0.958 per Hour m4.10xlarge 40 124.5 160 EBS Only $2.394 per Hour m4.16xlarge 64 188 256 EBS Only $3.83 per Hour m3.medium 1 3 3.75 1 x 4 SSD $0.067 per Hour m3.large 2 6.5 7.5 1 x 32 SSD $0.133 per Hour m3.xlarge 4 13 15 2 x 40 SSD $0.266 per Hour m3.2xlarge 8 26 30 2 x 80 SSD $0.532 per Hour c4.large 2 8 3.75 EBS Only $0.105 per Hour c4.xlarge 4 16 7.5 EBS Only $0.209 per Hour c4.2xlarge 8 31 15 EBS Only $0.419 per Hour c4.4xlarge 16 62 30 EBS Only $0.838 per Hour c4.8xlarge 36 132 60 EBS Only $1.675 per Hour c3.large 2 7 3.75 2 x 16 SSD $0.105 per Hour c3.xlarge 4 14 7.5 2 x 40 SSD $0.21 per Hour c3.2xlarge 8 28 15 2 x 80 SSD $0.42 per Hour c3.4xlarge 16 55 30 2 x 160 SSD $0.84 per Hour c3.8xlarge 32 108 60 2 x 320 SSD $1.68 per Hour p2.xlarge 4 12 61 EBS Only $0.9 per Hour p2.8xlarge 32 94 488 EBS Only $7.2 per Hour p2.16xlarge 64 188 732 EBS Only $14.4 per Hour g2.2xlarge 8 26 15 60 SSD $0.65 per Hour g2.8xlarge 32 104 60 2 x 120 SSD $2.6 per Hour x1.16xlarge 64 174.5 976 1 x 1920 SSD $6.669 per Hour x1.32xlarge 128 349 1952 2 x 1920 SSD $13.338 per Hour r3.large 2 6.5 15 1 x 32 SSD $0.166 per Hour r3.xlarge 4 13 30.5 1 x 80 SSD $0.333 per Hour r3.2xlarge 8 26 61 1 x 160 SSD $0.665 per Hour r3.4xlarge 16 52 122 1 x 320 SSD $1.33 per Hour r3.8xlarge 32 104 244 2 x 320 SSD $2.66 per Hour i2.xlarge 4 14 30.5 1 x 800 SSD $0.853 per Hour i2.2xlarge 8 27 61 2 x 800 SSD $1.705 per Hour i2.4xlarge 16 53 122 4 x 800 SSD $3.41 per Hour i2.8xlarge 32 104 244 8 x 800 SSD $6.82 per Hour d2.xlarge 4 14 30.5 3 x 2000 HDD $0.69 per Hour d2.2xlarge 8 28 61 6 x 2000 HDD $1.38 per Hour d2.4xlarge 16 56 122 12 x 2000 HDD $2.76 per Hour d2.8xlarge 36 116 244 24 x 2000 HDD $5.52 per Hour

I could contribute a code snippet (after finalising it) which you could include in your build process if this is what should finally happen...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ec2/issues/36#issuecomment-258173387, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszAHJKSrijyyH4M5qcJimqx46v83Wks5q6ftbgaJpZM4KoKnF .

quasiben commented 7 years ago

I'd suggest looking/scraping/storing from: http://www.ec2instances.info/

deeplook commented 7 years ago

@quasiben Ah, much easier! Especially since it works ok with pandas.read_html();-)

BTW, Ad 2 (Is it possible to launch an EC2 node with a time-to-live?) A friend of mine suggested setting up a cron script right after building it to shut down the instance after a given TTL as an option.