Open mariodirenzo opened 1 year ago
I was looking at this code tonight and there is a related bug: the member methods of Realm::MultiAffineAccessor
(e.g. read
and write
) are marked as REALM_CUDA_HD
, but the data structure that is used for the cache is not (to the best of my ability to tell) placed in zero-copy memory or copied down to the device when using the accessor on the device.
An accessor can be copied by value (e.g. by passing it as an argument to a CUDA kernel launch) and the copy will have a cache that is now distinct from the original accessor's cache. There is no global storage for the cache and there should be no problems with device accessors unless somebody puts one in shared memory.
@mariodirenzo we can make the cpu-side one thread safe, but it will add locking overhead on every use, which seems bad. I can add another template parameter to MultiAffineAccessor, but we'd need @lightsighter to plumb it through the legion interfaces I think? I'm not wild about this approach though, because anybody who knows enough about the problem to opt for the slower-but-thread-safe version already knows what they need to know to avoid the problem (e.g. firstprivate for openmp).
I can add another template parameter to MultiAffineAccessor, but we'd need @lightsighter to plumb it through the legion interfaces I think?
I don't think the Legion interfaces will need to change. We already template all Legion accessors on the underlying Realm accessor so that will abstract nicely. Unfortunately, Legion does do a bunch of template partial specialization behind the scenes, and this would effectively create a new kind of Realm accessor type that we'd need to specialize on, so I'm not a huge fan of that.
If it is complicated, I am ok with keeping things as they are. However, it would be nice to have some kind of guards that check if someone is using the MultiAffineAccessor in openMP loops in an unsafe way. Another option is to give a template parameter that deactivates the cache system by declaring the Realm accessor as constant. If you need to access a few points in the sparse compact instance, the performance penalty might be small.
Another option is to give a template parameter that deactivates the cache system by declaring the Realm accessor as constant.
I think that is exactly what @streichler already proposed here:
I can add another template parameter to MultiAffineAccessor... I'm not wild about this approach though, because anybody who knows enough about the problem to opt for the slower-but-thread-safe version already knows what they need to know to avoid the problem (e.g. firstprivate for openmp).
I'm not a big fan of that either for the reasons above.
Let me float an alternative solution: @streichler can we just remove the cache from the MultiAffineAccessor
? It seems like the common case where it would be valuable would be in the case where users are accessing all the elements in one piece before moving onto the next piece. If they're doing that, they really shouldn't be using the MultiAffineAccessor
at all in the first place anyway as they should be using a PieceIterator
to iterate over the pieces, and making normal AffineAccessors
for each piece. In that sense, the cache just feels superfluous to me. The MultiAffineAccessor
should really be for true random access where users are unlikely to hit in the cache anyway.
I also think the lack of thread safety is problematic for the GPU in addition to the OpenMP case as many threads in the same thread block could be using the same MultiAffineAccessor
in the parameter space of the kernel at the same time and racing on the cache there as well. Removing the cache would just fix all those problems.
I also think just making accessors "stateless" and thread-safe feels better to me from a design principle standpoint.
I think that is exactly what @streichler already proposed here
I guess that declaring it as constant does not require any change on the Realm side as the constant accessor automatically deactivates the cache system
I suppose you could have const
and non-const
versions of all the methods with the non-const
methods using the cache and the const
ones ignoring it. The entire object itself though would need to be marked const
though in order to use it. Legion would almost certainly need another template parameter to control that.
I believe we have thread safety on the device because each GPU thread has its own copy of the accessor.
The change I was proposing for the cpu side was to add a template-parameter-controlled mutex so that the cache would be thread safe when shared by openmp threads. The Realm MultiAffineAccessor already knows to disable the cache if used via a const reference/pointer.
Looking up a point in the accessor requires log(N) sequential fetches from global device (or, worse, pinned sysmem) memory. Even a fairly low hit rate in the cache (which tests a rectangle that lives in thread registers) seems very likely to come out ahead.
@mariodirenzo did you have performance data for your specific gpu kernels with/without the cache?
did you have performance data for your specific gpu kernels with/without the cache?
I do not have performance data right now but I can get them in the next few days
I believe we have thread safety on the device because each GPU thread has its own copy of the accessor.
That's definitely not true. Here's a counter example:
#include <cstdio>
#include "cuda_runtime.h"
class Accessor {
public:
__device__ __forceinline__
void access(void)
{
printf("Thread %d of block %d reading accessor %p with value %d\n",
threadIdx.x, blockIdx.x, this, val);
}
public:
int val;
};
__global__
void kernel(Accessor accessor)
{
accessor.access();
}
int main(void)
{
Accessor a;
a.val = 4;
kernel<<<10,1>>>(a);
cudaDeviceSynchronize();
return 0;
}
And the output:
Thread 0 of block 3 reading accessor 0x7f3405fffd1c with value 4
Thread 0 of block 9 reading accessor 0x7f3405fffd1c with value 4
Thread 0 of block 0 reading accessor 0x7f3405fffd1c with value 4
Thread 0 of block 6 reading accessor 0x7f3405fffd1c with value 4
Thread 0 of block 4 reading accessor 0x7f3405fffd1c with value 4
Thread 0 of block 2 reading accessor 0x7f3405fffd1c with value 4
Thread 0 of block 8 reading accessor 0x7f3405fffd1c with value 4
Thread 0 of block 1 reading accessor 0x7f3405fffd1c with value 4
Thread 0 of block 7 reading accessor 0x7f3405fffd1c with value 4
Thread 0 of block 5 reading accessor 0x7f3405fffd1c with value 4
...
All the threads see the same accessor in the parameter space of the kernel and they can all dispatch non-const methods on it. To be clear, I think it's ridiculous that CUDA lets anything non-const go in the parameter space of the kernel, but that is the way that is seems to be at the moment.
We also have a related issue: the Legion FieldAccessor
instantiations with MultiAffineAccessor
are currently marking those accessors as mutable
(I don't remember why this is the case, but I remember it was for something that @mariodirenzo wanted). So you can pass in a const FieldAccessor<...., MultiAffineAcessor<...>>
as an argument to a kernel and the compiler will happily put that in the global parameter space of the kernel shared by all the threads in the kernel. When they use the accessor though, they'll be invoking the non-const member methods and racing on the cache. At a minimum I think I will probably get rid of all those mutable
annotations on the Legion side unless @mariodirenzo gives me a reason not to, which will disable the cache for all uses of the MultiAffineAccessor
in Legion.
Curious to see what the performance data shows as well.
You can't write to parameter space. If you do, the compiler makes a per-thread copy in local memory and/or registers. Can you dump the SASS for that kernel please?
One thread modifies the accessor and then there's a syncthreads, but you can see in the ptx that the modified value is never written anywhere the other threads can see it before the syncthreads.
We also have a related issue: the Legion FieldAccessor instantiations with MultiAffineAccessor are currently marking those accessors as mutable (I don't remember why this is the case, but I remember it was for something that @mariodirenzo wanted).
If I recall correctly, we switched to using mutable
after a lengthy discussion on Slack where we discovered that we were not using the caching system of the MultiAffineAccessor
.
At that time I was not using padded space and the caching inside the MultiAffineAccessor
was helping performance (I can't find the file with the actual numbers). Now I have fewer accesses to the sparse compact instances and I suspect that the benefit of caching is not that large. I'll try to get quantitative measurements though.
One thread modifies the accessor and then there's a syncthreads, but you can see in the ptx that the modified value is never written anywhere the other threads can see it before the syncthreads.
That's pretty insane. The compiler suddenly making copy-on-write copies of arguments is just bonkers.
At that time I was not using padded space and the caching inside the MultiAffineAccessor was helping performance (I can't find the file with the actual numbers). Now I have fewer accesses to the sparse compact instances and I suspect that the benefit of caching is not that large. I'll try to get quantitative measurements though.
Ok, I'll wait to see the performance numbers before switching anything.
HTR performance - 20 iterations using @mariodirenzo supplied test + htr (Develop branch, commit bfcbd9af) Legion without 'mutable' (commented out in legion.inl) -> 34.176 seconds Legion with 'mutable' -> 22.015 seconds
Right so we already have marked all the MultiAffineAccessor
uses in the templates as mutable for performance (which I'm still not wild about since it's different than other cases). However, that solution is not actually safe if you're going to use the MultiAffineAccessor
in a task running on an OpenMP processor. The question is whether we are willing to require users to use firstprivate
when using these kinds of accessors (but not other kinds of accessors) when using them with OpenMP processors.
The same HTR tests @mariodirenzo supplied for this shows no performance improvement on CPUs with/without mutable for MultiAffineAccessor. On GPUs there is a performance improvement as described earlier
If the trend that @seemamirch has seen is confirmed for other use cases, one solution could be to avoid (or add a bool template parameter) the mutable
keyword for the CPU version of the accessor, making them threadsafe. The GPU version could keep having the mutable Realm accessor
The same HTR tests @mariodirenzo supplied for this shows no performance improvement on CPUs with/without mutable for MultiAffineAccessor
Was that with LOC_PROC
or OMP_PROC
processors? If the latter, how many threads per processor? How did you guarantee thread safety?
avoid the mutable keyword for the CPU version of the accessor, making them threadsafe. The GPU version could keep having the mutable Realm accessor
That's not actually a safe thing to do for CUDA. The host side representation of the structure needs to be identical to the device side representation. If you don't you can get undefined behavior. Now theoretically mutable
shouldn't change the layout so it might be ok. However, I've seen issues with differences between the mutable
keyword on the host and device before (e.g. with the Domain
class in Legion).
or add a bool template parameter
We would be forcing that template parameter onto all FieldAccessor
users for all Realm accessor kinds. That seems overkill to me and overly burdensome on the common case.
The same HTR tests @mariodirenzo supplied for this shows no performance improvement on CPUs with/without mutable for MultiAffineAccessor
Was that with
LOC_PROC
orOMP_PROC
processors? If the latter, how many threads per processor? How did you guarantee thread safety?OMP_PROC processors, 8 threads per processor. It uses firstprivate for multi affine accessors
avoid the mutable keyword for the CPU version of the accessor, making them threadsafe. The GPU version could keep having the mutable Realm accessor
That's not actually a safe thing to do for CUDA. The host side representation of the structure needs to be identical to the device side representation. If you don't you can get undefined behavior. Now theoretically
mutable
shouldn't change the layout so it might be ok. However, I've seen issues with differences between themutable
keyword on the host and device before (e.g. with theDomain
class in Legion).or add a bool template parameter
We would be forcing that template parameter onto all
FieldAccessor
users for all Realm accessor kinds. That seems overkill to me and overly burdensome on the common case.
OMP_PROC processors, 8 threads per processor. It uses firstprivate for multi affine accessors
Do the tasks linearly walk the memory for the compact-sparse instances or is it random-access look up in the compact sparse instances?
Do the tasks linearly walk the memory for the compact-sparse instances or is it random-access look up in the compact sparse instances?
They linearly walk through the compact-sparse instances
If MultiAffineAccessors are not marked as a constant variable, they implement a caching algorithm that remembers the last accessed affine piece. This mechanism is not thread-safe and using the accessors in OpenMP parallelized loops may lead to undefined behavior. The current known solution is to mark the accessors as
firstprivate
in the OpenMP loop header.Probably a more robust solution is required.