Set up basic supporting infrastructure

steinbachr commented 4 years ago

This will be a requirement for #2 .

We have two options here, and I'm not yet sure which is the right one:

Option 1

For this, we back compute with EC2 resources and an ASG, so we'll need:

a launch template or configuration that includes the software to run folding@home software on start
an Autoscaling Group

Option 2

We back compute with Fargate resources. A pre-requisite here is that we can get folding@home working properly in a container and the container can be juiced with enough access to the host compute resources:

Fargate cluster
Docker image pre-loaded with software to run folding@home on launch

atkinsonm commented 4 years ago

There are Docker images out on DockerHub we could use if we choose to go that route. Here's one: https://hub.docker.com/r/johnktims/folding-at-home/

Need to compare EC2 versus Fargate in terms of billing predictability and the scaling/scheduling engine

steinbachr commented 4 years ago

agree, however I think due to time constraints we forge forward with EC2. If we end up having a day to spare, we can circle back to the infrastructure and see about running testing for EC2 v. Fargate.

jkataja commented 4 years ago

This template creates a launch template and an autoscaling group: https://github.com/jkataja/cfn-foldingathome

atkinsonm commented 4 years ago

@jkataja thanks, I referenced your project on #1. What you've done is great and I don't really want to change a thing about it, just design a scaling algorithm around it.

atkinsonm commented 4 years ago

Here are my ideas and assumptions for MVP:

Assume all fah jobs will take approximately 4 hours to complete. This is likely an over-estimate but it reduces the risk that instances will be terminated before they are done and I like that it divides a 24 hour day into even parts.
Pick just one instance type to avoid having to query the AWS Price List API and make complex decisions.
- Example: g4dn.8xlarge Linux in us-east-1 is $2.176 per Hour or with some really rough rounding $10 for 4 hours. Bid the on-demand price as the spot price to optimize costs while minimizing the chances of early termination.
As donations come in, enqueue messages to an SQS queue. Every 4 hours (via cron CW event), run a Lambda function to process all incoming donations and adjust a custom CloudWatch metric representing the "bank balance".
Scale desired count (in or out) on an autoscaling group in conjunction with the bank balance (e.g. as in the example above, create 1 g4dn instance for every $10 available)
Decrement the bank balance by the desired multiple as part of the Lambda function

Other technical features:

Monitor CPU/GPU utilization. Assume low CPU/GPU means that the instances are not picking up jobs, and take them out of service (also decrement the ASG desired count)

steinbachr commented 4 years ago

I am 100% on-board. Only one point of clarification, you mean SQS messages not SNS, yeah?

atkinsonm commented 4 years ago

Yep, typo

atkinsonm commented 4 years ago

Splitting off the last monitoring point to #6

jkataja commented 4 years ago

@atkinsonm glad you find it useful! The easiest way of scaling would be to change the auto scaling group size. It is controlled by stack parameter, and the easiest would be to do a CloudFormation stack update which changes the scaling. The template installs a bunch of stuff in instance initialization from the user data script. I did not create an AMI to make the template easy to use in different accounts and also to avoid license issues with commercial NVidia CUDA drivers and the Folding@home client. If everything is contained within a single account, then the scaling up would be much faster with a pre-baked AMI containing all the software already installed.

jkataja commented 4 years ago

Also, the template uses smallest G4 instance size g4dn.xlarge to keep the costs low. For eu-north-1 on-demand is 0.5580 with cheapest spot at 0.1674. Also I assume the Folding@home client is best tested with one GPU present, didn't want to work with multiple GPU issues.

jkataja commented 4 years ago

Just checked now, GPU task took 2 hours 12 minutes (from 11:43 to 13:55) to run. Edit: CPU task took 5 hours 28 minutes (from 04:24 to 9:52) to run.

jkataja commented 4 years ago

The software can run anywhere in the world. The optimal solution would take pricing differences in the account and run it in the cheapest availability zones, possibly even cheapest regions.

atkinsonm commented 4 years ago

Marking as a duplicate of #1

MeanPug / folding-together

Set up basic supporting infrastructure #3

Option 1

Option 2