ICLDisco / parsec

PaRSEC is a generic framework for architecture aware scheduling and management of micro-tasks on distributed, GPU accelerated, many-core heterogeneous architectures. PaRSEC assigns computation threads to the cores, GPU accelerators, overlaps communications and computations and uses a dynamic, fully-distributed scheduler based on architectural features such as NUMA nodes and algorithmic features such as data reuse.
Other
48 stars 17 forks source link

Support for oversubscription broken #601

Open devreal opened 10 months ago

devreal commented 10 months ago

Describe the bug

We see the following warnings (followed by asserts in debug mode) when memory on the device is tight:

W@00000 GPU[hip(0)]:    Write access to data copy 0x7fbe35bdbb10 [ref_count 1] with existing readers [1024] (possible anti-dependency,
or concurrent accesses), please prevent that with CTL dependencies

The 1024 is suspicious and points us to #575. @therault and I found that the rollback of the CAS is wrong. The CAS is done on an element that we will abandon and is only there to block someone from taking the element. There is no need to rollback the CAS.

Once we have released the LRU element, we go back to malloc_data. Now there is a pretty good chance that the zone_alloc succeeds. We still have PARSEC_CUDA_DATA_COPY_ATOMIC_SENTINEL as copy_readers_update, which will then be applied to the gpu_elem at the end.

I think it's safe to remove everything to do with copy_readers_update (i.e., the fetch-and-op and all places where we set it) as the readers field in the final gpu_elem does not need to be adjusted.

bosilca commented 8 months ago

I don't think this analysis is correct.

  1. Nobody can take that element. This entire function is done in the context of the thread handling the current device (where the copy is located), so is protected. What that CAS is protecting from, is from another thread trying to use the copy as source for a device-to-device transfer (this is not ownership).
  2. We do not abandon the copy, we detach it from the old master and then we repurpose it for another data. Once this done, the readers shall be 0 again.
  3. When we go back to malloc_data the first thing we do is to reset the copy_readers_update to zero