JuliaGPU / AMDGPU.jl

AMD GPU (ROCm) programming in Julia
Other
281 stars 47 forks source link

Consolidate thread lock/data race handling #277

Open torrance opened 2 years ago

torrance commented 2 years ago

I've noticed a number of competing techniques used to manage data races throughout the codebase.

There is the RT_LOCK (I assume RT = runtime) used to manage global state access and modifications. However, it's not immediately clear which objects require this RT_LOCK, and its use is enforced only by code review.

In a few other places I've also noticed use of atomics to handle these cases, for example in the Queue objects during deletion.

I would like to suggest standardising on how we handle this to ensure correctness and simplicity of design. It makes code brittle if their are competing paradigms for handling data race conditions.

I'm not an expert in how to handle this kind of thing, and I'm sure there are various options but I'd like to propose to couple any shared state object with its lock, e.g. something like:

struct LockedObject{T}
    lock Threads.ReentrantLock
    payload T
end

and accesses to this object can happen via a pattern such as:

lock(QUEUES::LockedObject) do QUEUES
    # push or pop or whatever
end

where lock() is simply something like:

function lock(f, x)
    lock(x.lock)
    try
        return f(x.payload)
    finally
        unlock(x.lock)
    end
end

The benefits here mean its obvious which objects require locking before use, and any collection of objects which require synchronisation can share a lock between them.

My 2 cents. But I think it's a discussion that needs to happen.

jpsamaroo commented 2 years ago

Yeah, this is a great point. My intention for RT_LOCK is for it to be a simple way to make access to various global objects safe (especially anything that's an AbstractDict). But we don't really want a big global lock for a lot of things, so having a LockedObject concept sounds like a great approach.

The queue atomics (on .queue and .active) were put in in an attempt to fix the various queue issues that we had been seeing in CI (most of which I think had other solutions). While I do think one of them should remain atomic (so that we can, without lock contention, check queue state just before kernel launch or queue queries), the other can and should probably become not-atomic or be removed. I would personally want to see .active disappear and have us check if .queue == C_NULL to check if a queue is live (and we'd then atomically store C_NULL to it on queue error).