MeanPug / folding-together

Democratizing folding@home (and potentially other networks like rosetta@home)
MIT License
1 stars 0 forks source link

Set up basic supporting infrastructure #3

Closed steinbachr closed 4 years ago

steinbachr commented 4 years ago

This will be a requirement for #2 .

We have two options here, and I'm not yet sure which is the right one:

Option 1

For this, we back compute with EC2 resources and an ASG, so we'll need:

Option 2

We back compute with Fargate resources. A pre-requisite here is that we can get folding@home working properly in a container and the container can be juiced with enough access to the host compute resources:

atkinsonm commented 4 years ago

There are Docker images out on DockerHub we could use if we choose to go that route. Here's one: https://hub.docker.com/r/johnktims/folding-at-home/

Need to compare EC2 versus Fargate in terms of billing predictability and the scaling/scheduling engine

steinbachr commented 4 years ago

agree, however I think due to time constraints we forge forward with EC2. If we end up having a day to spare, we can circle back to the infrastructure and see about running testing for EC2 v. Fargate.

jkataja commented 4 years ago

This template creates a launch template and an autoscaling group: https://github.com/jkataja/cfn-foldingathome

atkinsonm commented 4 years ago

@jkataja thanks, I referenced your project on #1. What you've done is great and I don't really want to change a thing about it, just design a scaling algorithm around it.

atkinsonm commented 4 years ago

Here are my ideas and assumptions for MVP:

Other technical features:

steinbachr commented 4 years ago

I am 100% on-board. Only one point of clarification, you mean SQS messages not SNS, yeah?

atkinsonm commented 4 years ago

Yep, typo

atkinsonm commented 4 years ago

Splitting off the last monitoring point to #6

jkataja commented 4 years ago

@atkinsonm glad you find it useful! The easiest way of scaling would be to change the auto scaling group size. It is controlled by stack parameter, and the easiest would be to do a CloudFormation stack update which changes the scaling. The template installs a bunch of stuff in instance initialization from the user data script. I did not create an AMI to make the template easy to use in different accounts and also to avoid license issues with commercial NVidia CUDA drivers and the Folding@home client. If everything is contained within a single account, then the scaling up would be much faster with a pre-baked AMI containing all the software already installed.

jkataja commented 4 years ago

Also, the template uses smallest G4 instance size g4dn.xlarge to keep the costs low. For eu-north-1 on-demand is 0.5580 with cheapest spot at 0.1674. Also I assume the Folding@home client is best tested with one GPU present, didn't want to work with multiple GPU issues.

jkataja commented 4 years ago

Just checked now, GPU task took 2 hours 12 minutes (from 11:43 to 13:55) to run. Edit: CPU task took 5 hours 28 minutes (from 04:24 to 9:52) to run.

jkataja commented 4 years ago

The software can run anywhere in the world. The optimal solution would take pricing differences in the account and run it in the cheapest availability zones, possibly even cheapest regions.

atkinsonm commented 4 years ago

Marking as a duplicate of #1