dsschult commented 7 years ago

We're leaving a lot of resources idle right now because of packing inefficiency. In the GPU case, where it matters most, we have too many high memory jobs.

Example:

pilot announces: 10 cpu, 8 gpu, 40GB memory
new task: 1 cpu, 1 gpu, 25GB memory
pilot announces: 9 cpu, 7 gpu, 15GB memory
new task: 1 cpu, 1 gpu, 12GB memory
pilot announces: 8 cpu, 6 gpu, 3GB memory
match failure - no new task

We just wasted 6 gpus.

A better packing might be:

pilot announces: 10 cpu, 8 gpu, 40GB memory
new task: 1 cpu, 1 gpu, 12GB memory
new task: 1 cpu, 1 gpu, 4GB memory
new task: 1 cpu, 1 gpu, 4GB memory
new task: 1 cpu, 1 gpu, 4GB memory
new task: 1 cpu, 1 gpu, 4GB memory
new task: 1 cpu, 1 gpu, 4GB memory
new task: 1 cpu, 1 gpu, 4GB memory
new task: 1 cpu, 1 gpu, 4GB memory
pilot announces: 2 cpu, 0 gpu, 0GB memory
match failure - no new task

This packing uses all gpus and memory - much more efficient. Though the 25GB task doesn't run then, so there are some problems with it.

One thing that may help significantly is if we can allocate multiple tasks at once, instead of one by one: #237. This would allow proper packing of a single pilot, though packing over multiple pilots and over time would still need work.

gonzalomerino commented 7 years ago

I believe there are many possibilities for how to do this matching. One option I could see is, start assuming that our higher value resource are gpus, then for the example above the pilot would know that the slot is managing has 8 GPUs with an average of 5GB memory per GPU.

In the algorithm we use for the matching, I think the pilot should be reluctant to carve slots with much more memory than the average. The first carved slot could be allowed go above average, but not the rest. Pretty much what you have displayed in your example above.

the tricky part I see here is that probably we want to have some capability for "configuring" this partitioning policy... if we want to ensure for instance that high mem jobs do get executed, because we decide they are important... or the opposite ...

with this "partitionable iceprod pilot" route you have chosen to take the matchmaking job of the negotiatior... tough job! :-)

dsschult commented 7 years ago

Yeah. I didn't really want to do the matchmaking job in iceprod. But I need to get around the limitations of condor, which led to partitionable pilots.

I'm currently thinking of implementing a variation on fair queueing, like the Linux scheduler uses. This is probably the easiest general way to solve the problem.

dsschult commented 7 years ago

Whiteboard solution:

multi-level queues:

gpu (for req_gpus >= 1)
mem (for req_memory >= 4)
default

scheduling priority within each queue (lowest priority wins):

priority = max(0, req_cpu + req_gpu + req_memory + req_disk + req_time - wait_sec//600)

where wait_sec is the time in seconds since a task was queued

So, when a new resource is available, choose tasks from queues like:

if gpus >= 1:
- gpu, gpu..., default, default...
elif memory >= 4:
- mem, default, default...
else:
- default, default...

dsschult commented 7 years ago

This seems to perform "better", though still not completely solved. Memory is still a limiting factor, especially memory fragmentation.

239 - memory oversubscription - may be what is really needed, since we apparently suck at requirement predictions (or there are spikes).

WIPACrepo / iceprod

task scheduling and the packing problem #238

239 - memory oversubscription - may be what is really needed, since we apparently suck at requirement predictions (or there are spikes).