DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
901 stars 240 forks source link

Provisioner doesn't know about (enough) memory overhead #3063

Open adamnovak opened 4 years ago

adamnovak commented 4 years ago

I had a test Cactus workflow get stuck on AWS autoscaling:

cactus --provisioner aws --nodeTypes t2.medium --maxNodes 2 --batchSystem mesos --binariesMode singularity --clean always aws:us-west-2:anovak-toil-cactus-test examples/evolverMammals.txt examples/evolverMammals.hal --root mr

When a t2.medium instance with 4 GB of RAM comes up and connects to Mesos, it can only offer 2.8 GB or so of memory to the Mesos master. The rest must be used by the Mesos agent, the OS, etc.

But Toil's autoscaler thinks a node with 4 GB of physical memory will be sufficient to run a 3.3GB Cactus job, so it sits there waiting for the node it provisioned to pick up the job.

We need to make the autoscaler account for (more?) memory overhead.

┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-554

adamnovak commented 4 years ago

This actually might be a consequence of Flatcar adoption; maybe it has more overhead than the super old CoreOS images we were using.

adamnovak commented 4 years ago

Nope, I've tested it without my Flatcar change and we still have the same problem: not enough memory gets offered, and we don't notice and bail.

DailyDreaming commented 4 years ago

@adamnovak Is there a more discerning way of detecting if an instance has "free" memory to run jobs?

Maybe we could have some sort of reporting system? Like, when a worker first spins up, its first service job sends back it's free resources. Toil stows this away and uses it to determine resource allocation going forward.

adamnovak commented 4 years ago

So like learning how much free memory a fresh instance of whatever type has, when it spins up, instead of just going off the total memory of the instance type? That might be able to work. If we have some way to tie the Mesos offers back to the instances and instance types they came from, maybe we could just read the memory out of the machine's first offer, on the leader.

On 5/27/20, Lon Blauvelt notifications@github.com wrote:

@adamnovak Is there a more discerning way of detecting if an instance has "free" memory to run jobs?

Maybe we could have some sort of reporting system? Like, when a worker first spins up, its first service job sends back it's free resources. Toil stows this away and uses it to determine resource allocation going forward.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/DataBiosphere/toil/issues/3063#issuecomment-634875358