Open adamnovak opened 4 years ago
This actually might be a consequence of Flatcar adoption; maybe it has more overhead than the super old CoreOS images we were using.
Nope, I've tested it without my Flatcar change and we still have the same problem: not enough memory gets offered, and we don't notice and bail.
@adamnovak Is there a more discerning way of detecting if an instance has "free" memory to run jobs?
Maybe we could have some sort of reporting system? Like, when a worker first spins up, its first service job sends back it's free resources. Toil stows this away and uses it to determine resource allocation going forward.
So like learning how much free memory a fresh instance of whatever type has, when it spins up, instead of just going off the total memory of the instance type? That might be able to work. If we have some way to tie the Mesos offers back to the instances and instance types they came from, maybe we could just read the memory out of the machine's first offer, on the leader.
On 5/27/20, Lon Blauvelt notifications@github.com wrote:
@adamnovak Is there a more discerning way of detecting if an instance has "free" memory to run jobs?
Maybe we could have some sort of reporting system? Like, when a worker first spins up, its first service job sends back it's free resources. Toil stows this away and uses it to determine resource allocation going forward.
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/DataBiosphere/toil/issues/3063#issuecomment-634875358
I had a test Cactus workflow get stuck on AWS autoscaling:
When a
t2.medium
instance with 4 GB of RAM comes up and connects to Mesos, it can only offer 2.8 GB or so of memory to the Mesos master. The rest must be used by the Mesos agent, the OS, etc.But Toil's autoscaler thinks a node with 4 GB of physical memory will be sufficient to run a 3.3GB Cactus job, so it sits there waiting for the node it provisioned to pick up the job.
We need to make the autoscaler account for (more?) memory overhead.
┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-554