Open abouteiller opened 1 day ago
I'm warming up to that idea. It could allow for some rebalancing, potentially choosing a different device. But since most worker threads are idle we don't win a lot of time really. The task will come back quickly.
It doesn't have to come back quickly. We have the epoch when the task was evicted, we can define a quiet period for the task during which it is not pushed back onto the same device.
We need to have a discussion about the right approach here, especially with regard to the task that triggered the need for the memory cleanup. Right now this tasks remains in the stream queue and will therefore be rescheduled for execution relatively soon (as the stream pending queue is relatively small). This will lead to a lot of noise when the GPU is trying to release memory.
We could instead also evict the task and let the upper-level scheduler handle the case. This will give us a little bit more time to find memory for a task that we know was missing some, but could also change the way tasks are ordered for execution (in the case where we hit memory constraints).
Originally posted by @bosilca in https://github.com/ICLDisco/parsec/issues/679#issuecomment-2414842983