ICLDisco / parsec

PaRSEC is a generic framework for architecture aware scheduling and management of micro-tasks on distributed, GPU accelerated, many-core heterogeneous architectures. PaRSEC assigns computation threads to the cores, GPU accelerators, overlaps communications and computations and uses a dynamic, fully-distributed scheduler based on architectural features such as NUMA nodes and algorithmic features such as data reuse.
Other
50 stars 17 forks source link

Eviction tasks may be rescheduled for execution relatively soon (as the stream pending queue is relatively small). This will lead to a lot of noise when the GPU is trying to release memory. #689

Open abouteiller opened 1 day ago

abouteiller commented 1 day ago

We need to have a discussion about the right approach here, especially with regard to the task that triggered the need for the memory cleanup. Right now this tasks remains in the stream queue and will therefore be rescheduled for execution relatively soon (as the stream pending queue is relatively small). This will lead to a lot of noise when the GPU is trying to release memory.

We could instead also evict the task and let the upper-level scheduler handle the case. This will give us a little bit more time to find memory for a task that we know was missing some, but could also change the way tasks are ordered for execution (in the case where we hit memory constraints).

Originally posted by @bosilca in https://github.com/ICLDisco/parsec/issues/679#issuecomment-2414842983

devreal commented 1 day ago

I'm warming up to that idea. It could allow for some rebalancing, potentially choosing a different device. But since most worker threads are idle we don't win a lot of time really. The task will come back quickly.

bosilca commented 1 day ago

It doesn't have to come back quickly. We have the epoch when the task was evicted, we can define a quiet period for the task during which it is not pushed back onto the same device.