StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
661 stars 146 forks source link

Realm: MultiAffineAccessors are not thread safe #1364

Open mariodirenzo opened 1 year ago

mariodirenzo commented 1 year ago

If MultiAffineAccessors are not marked as a constant variable, they implement a caching algorithm that remembers the last accessed affine piece. This mechanism is not thread-safe and using the accessors in OpenMP parallelized loops may lead to undefined behavior. The current known solution is to mark the accessors as firstprivate in the OpenMP loop header.

Probably a more robust solution is required.

lightsighter commented 1 year ago

I was looking at this code tonight and there is a related bug: the member methods of Realm::MultiAffineAccessor (e.g. read and write) are marked as REALM_CUDA_HD, but the data structure that is used for the cache is not (to the best of my ability to tell) placed in zero-copy memory or copied down to the device when using the accessor on the device.

streichler commented 1 year ago

An accessor can be copied by value (e.g. by passing it as an argument to a CUDA kernel launch) and the copy will have a cache that is now distinct from the original accessor's cache. There is no global storage for the cache and there should be no problems with device accessors unless somebody puts one in shared memory.

@mariodirenzo we can make the cpu-side one thread safe, but it will add locking overhead on every use, which seems bad. I can add another template parameter to MultiAffineAccessor, but we'd need @lightsighter to plumb it through the legion interfaces I think? I'm not wild about this approach though, because anybody who knows enough about the problem to opt for the slower-but-thread-safe version already knows what they need to know to avoid the problem (e.g. firstprivate for openmp).

lightsighter commented 1 year ago

I can add another template parameter to MultiAffineAccessor, but we'd need @lightsighter to plumb it through the legion interfaces I think?

I don't think the Legion interfaces will need to change. We already template all Legion accessors on the underlying Realm accessor so that will abstract nicely. Unfortunately, Legion does do a bunch of template partial specialization behind the scenes, and this would effectively create a new kind of Realm accessor type that we'd need to specialize on, so I'm not a huge fan of that.

mariodirenzo commented 1 year ago

If it is complicated, I am ok with keeping things as they are. However, it would be nice to have some kind of guards that check if someone is using the MultiAffineAccessor in openMP loops in an unsafe way. Another option is to give a template parameter that deactivates the cache system by declaring the Realm accessor as constant. If you need to access a few points in the sparse compact instance, the performance penalty might be small.

lightsighter commented 1 year ago

Another option is to give a template parameter that deactivates the cache system by declaring the Realm accessor as constant.

I think that is exactly what @streichler already proposed here:

I can add another template parameter to MultiAffineAccessor... I'm not wild about this approach though, because anybody who knows enough about the problem to opt for the slower-but-thread-safe version already knows what they need to know to avoid the problem (e.g. firstprivate for openmp).

I'm not a big fan of that either for the reasons above.

Let me float an alternative solution: @streichler can we just remove the cache from the MultiAffineAccessor? It seems like the common case where it would be valuable would be in the case where users are accessing all the elements in one piece before moving onto the next piece. If they're doing that, they really shouldn't be using the MultiAffineAccessor at all in the first place anyway as they should be using a PieceIterator to iterate over the pieces, and making normal AffineAccessors for each piece. In that sense, the cache just feels superfluous to me. The MultiAffineAccessor should really be for true random access where users are unlikely to hit in the cache anyway.

I also think the lack of thread safety is problematic for the GPU in addition to the OpenMP case as many threads in the same thread block could be using the same MultiAffineAccessor in the parameter space of the kernel at the same time and racing on the cache there as well. Removing the cache would just fix all those problems.

I also think just making accessors "stateless" and thread-safe feels better to me from a design principle standpoint.

mariodirenzo commented 1 year ago

I think that is exactly what @streichler already proposed here

I guess that declaring it as constant does not require any change on the Realm side as the constant accessor automatically deactivates the cache system

lightsighter commented 1 year ago

I suppose you could have const and non-const versions of all the methods with the non-const methods using the cache and the const ones ignoring it. The entire object itself though would need to be marked const though in order to use it. Legion would almost certainly need another template parameter to control that.

streichler commented 1 year ago

I believe we have thread safety on the device because each GPU thread has its own copy of the accessor.

The change I was proposing for the cpu side was to add a template-parameter-controlled mutex so that the cache would be thread safe when shared by openmp threads. The Realm MultiAffineAccessor already knows to disable the cache if used via a const reference/pointer.

Looking up a point in the accessor requires log(N) sequential fetches from global device (or, worse, pinned sysmem) memory. Even a fairly low hit rate in the cache (which tests a rectangle that lives in thread registers) seems very likely to come out ahead.

@mariodirenzo did you have performance data for your specific gpu kernels with/without the cache?

mariodirenzo commented 1 year ago

did you have performance data for your specific gpu kernels with/without the cache?

I do not have performance data right now but I can get them in the next few days

lightsighter commented 1 year ago

I believe we have thread safety on the device because each GPU thread has its own copy of the accessor.

That's definitely not true. Here's a counter example:

#include <cstdio>
#include "cuda_runtime.h"

class Accessor {
public:
  __device__ __forceinline__
  void access(void)
  {
    printf("Thread %d of block %d reading accessor %p with value %d\n",
      threadIdx.x, blockIdx.x, this, val);  
  }
public:
  int val;
};

__global__
void kernel(Accessor accessor)
{
  accessor.access(); 
}

int main(void)
{
  Accessor a;
  a.val = 4;

  kernel<<<10,1>>>(a);
  cudaDeviceSynchronize();

  return 0;
}

And the output:

Thread 0 of block 3 reading accessor 0x7f3405fffd1c with value 4
Thread 0 of block 9 reading accessor 0x7f3405fffd1c with value 4
Thread 0 of block 0 reading accessor 0x7f3405fffd1c with value 4
Thread 0 of block 6 reading accessor 0x7f3405fffd1c with value 4
Thread 0 of block 4 reading accessor 0x7f3405fffd1c with value 4
Thread 0 of block 2 reading accessor 0x7f3405fffd1c with value 4
Thread 0 of block 8 reading accessor 0x7f3405fffd1c with value 4
Thread 0 of block 1 reading accessor 0x7f3405fffd1c with value 4
Thread 0 of block 7 reading accessor 0x7f3405fffd1c with value 4
Thread 0 of block 5 reading accessor 0x7f3405fffd1c with value 4
...

All the threads see the same accessor in the parameter space of the kernel and they can all dispatch non-const methods on it. To be clear, I think it's ridiculous that CUDA lets anything non-const go in the parameter space of the kernel, but that is the way that is seems to be at the moment.

We also have a related issue: the Legion FieldAccessor instantiations with MultiAffineAccessor are currently marking those accessors as mutable (I don't remember why this is the case, but I remember it was for something that @mariodirenzo wanted). So you can pass in a const FieldAccessor<...., MultiAffineAcessor<...>> as an argument to a kernel and the compiler will happily put that in the global parameter space of the kernel shared by all the threads in the kernel. When they use the accessor though, they'll be invoking the non-const member methods and racing on the cache. At a minimum I think I will probably get rid of all those mutable annotations on the Legion side unless @mariodirenzo gives me a reason not to, which will disable the cache for all uses of the MultiAffineAccessor in Legion.

Curious to see what the performance data shows as well.

streichler commented 1 year ago

You can't write to parameter space. If you do, the compiler makes a per-thread copy in local memory and/or registers. Can you dump the SASS for that kernel please?

streichler commented 1 year ago

Here's an example: https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1DIAruiakl9ZATwDKjdAGFUtEywYgA7KUcAZPAZMADl3ACNMYgkuUgAHVAVCOwYXNw9vOISkgQCg0JYIqK4Yq0wbZKECJmICVPdPH1LygUrqglyQ8MjoyyqauvTGvvbAzoLu4oBKS1QTYmR2DmQDBQUAagBBZAXVkjWAUi8AIX2NDbW12JMw2jxkEFPzi7XAgjXVfYBmE7PntYB9f5YABud0wgLWwNQeHQaxYJiqBEwEFeawAnpMDsd3gcAEwnT4AEXRXyOWMJj0OFO%2BlLOgOAtFQYTEEKhMLWAGtIkFaBAtjsFHsLC9BGsAFSxSaU46PP54KgQAgIYiYJjoACS6FUADpVAciV9iRopb8/msmNr4YjkVwpTTTRdAQo0QwbMrVegFBA7T8nhcJfriRaPvaNlSONNaJwAKy8TwcLSkVCcJxyQnnJz7fFZsmCuYLPGfHikAiaCPTDkgT6fbXVuv1hsANn0nEkcbLSc4vAUIA0JbL0zgsCQaBYsTokXIlFH4/oUQYwO2XFxMRBYIAanhMAB3ADysUYHZotCRxB7EDCHbCgWqaM4xevzGIaN3YW0ZVL3F4o7Ygl3DFoO8E14LAWEMBlFkTfAVXKYFMB7YDfFUMoEUWYtXkwKNENuMJiFvFwsA7AhiDwFh7wjPgDGABRNx3fdDy/GRBBEMR2CkJj5CUNQO10GIDCMFAzAsHCe0gaZUFiWwBAQ7tMI/ZIHAYZxXHqPR/FGfJCj0eJEiklIVPSGIdOyBgOk0iZLDk5oGFafoDM8EorL02yRjyLoil6NoBgczyajM9yJGmPN5jYyMY3bRDkw4NZgnXJwnDWZdtS4bUNDWCBcEIPYsyLSZeE/LRJgrKsawbMq62bLC21IeNEyi7te37YDBxgRAQDXBYpwgGcJ2IYJWEWWL4sS3FktSkDMFBBYQGUAAVAANXx8CIEj0D0fhmNEcR2I2ziVHURDeMs6w9MU5S0h89S3PGDzjL07ztKyPT/JuvQmmc4YHsck6KmGF6tJKT77Lev6NIC20SxVTAYV7MKOFjGqOyiwlJrBNZaO3SIYrihKkpStKMuW7LcSLUg1hcMdesLW18oHYrq1rcqypbDhqtq3h6ssRqCvLFncQiuquyawrpjgs8FMkIA%3D%3D

One thread modifies the accessor and then there's a syncthreads, but you can see in the ptx that the modified value is never written anywhere the other threads can see it before the syncthreads.

mariodirenzo commented 1 year ago

We also have a related issue: the Legion FieldAccessor instantiations with MultiAffineAccessor are currently marking those accessors as mutable (I don't remember why this is the case, but I remember it was for something that @mariodirenzo wanted).

If I recall correctly, we switched to using mutable after a lengthy discussion on Slack where we discovered that we were not using the caching system of the MultiAffineAccessor. At that time I was not using padded space and the caching inside the MultiAffineAccessor was helping performance (I can't find the file with the actual numbers). Now I have fewer accesses to the sparse compact instances and I suspect that the benefit of caching is not that large. I'll try to get quantitative measurements though.

lightsighter commented 1 year ago

One thread modifies the accessor and then there's a syncthreads, but you can see in the ptx that the modified value is never written anywhere the other threads can see it before the syncthreads.

That's pretty insane. The compiler suddenly making copy-on-write copies of arguments is just bonkers.

At that time I was not using padded space and the caching inside the MultiAffineAccessor was helping performance (I can't find the file with the actual numbers). Now I have fewer accesses to the sparse compact instances and I suspect that the benefit of caching is not that large. I'll try to get quantitative measurements though.

Ok, I'll wait to see the performance numbers before switching anything.

seemamirch commented 3 months ago

HTR performance - 20 iterations using @mariodirenzo supplied test + htr (Develop branch, commit bfcbd9af) Legion without 'mutable' (commented out in legion.inl) -> 34.176 seconds Legion with 'mutable' -> 22.015 seconds

lightsighter commented 3 months ago

Right so we already have marked all the MultiAffineAccessor uses in the templates as mutable for performance (which I'm still not wild about since it's different than other cases). However, that solution is not actually safe if you're going to use the MultiAffineAccessor in a task running on an OpenMP processor. The question is whether we are willing to require users to use firstprivate when using these kinds of accessors (but not other kinds of accessors) when using them with OpenMP processors.

seemamirch commented 2 months ago

The same HTR tests @mariodirenzo supplied for this shows no performance improvement on CPUs with/without mutable for MultiAffineAccessor. On GPUs there is a performance improvement as described earlier

mariodirenzo commented 2 months ago

If the trend that @seemamirch has seen is confirmed for other use cases, one solution could be to avoid (or add a bool template parameter) the mutable keyword for the CPU version of the accessor, making them threadsafe. The GPU version could keep having the mutable Realm accessor

lightsighter commented 2 months ago

The same HTR tests @mariodirenzo supplied for this shows no performance improvement on CPUs with/without mutable for MultiAffineAccessor

Was that with LOC_PROC or OMP_PROC processors? If the latter, how many threads per processor? How did you guarantee thread safety?

avoid the mutable keyword for the CPU version of the accessor, making them threadsafe. The GPU version could keep having the mutable Realm accessor

That's not actually a safe thing to do for CUDA. The host side representation of the structure needs to be identical to the device side representation. If you don't you can get undefined behavior. Now theoretically mutable shouldn't change the layout so it might be ok. However, I've seen issues with differences between the mutable keyword on the host and device before (e.g. with the Domain class in Legion).

or add a bool template parameter

We would be forcing that template parameter onto all FieldAccessor users for all Realm accessor kinds. That seems overkill to me and overly burdensome on the common case.

seemamirch commented 2 months ago

The same HTR tests @mariodirenzo supplied for this shows no performance improvement on CPUs with/without mutable for MultiAffineAccessor

Was that with LOC_PROC or OMP_PROC processors? If the latter, how many threads per processor? How did you guarantee thread safety?

OMP_PROC processors, 8 threads per processor. It uses firstprivate for multi affine accessors

avoid the mutable keyword for the CPU version of the accessor, making them threadsafe. The GPU version could keep having the mutable Realm accessor

That's not actually a safe thing to do for CUDA. The host side representation of the structure needs to be identical to the device side representation. If you don't you can get undefined behavior. Now theoretically mutable shouldn't change the layout so it might be ok. However, I've seen issues with differences between the mutable keyword on the host and device before (e.g. with the Domain class in Legion).

or add a bool template parameter

We would be forcing that template parameter onto all FieldAccessor users for all Realm accessor kinds. That seems overkill to me and overly burdensome on the common case.

lightsighter commented 2 months ago

OMP_PROC processors, 8 threads per processor. It uses firstprivate for multi affine accessors

Do the tasks linearly walk the memory for the compact-sparse instances or is it random-access look up in the compact sparse instances?

mariodirenzo commented 2 months ago

Do the tasks linearly walk the memory for the compact-sparse instances or is it random-access look up in the compact sparse instances?

They linearly walk through the compact-sparse instances