StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
657 stars 146 forks source link

S3D: creating hundreds of instances #1678

Open syamajala opened 2 months ago

syamajala commented 2 months ago

I can see S3D creating hundreds of small instances in system memory and I have no idea why or where they're coming from.

I tried the logging wrapper and can see things like:

[0 - 1554f8036000]    8.970248 {2}{inst}: creating new local instance: 4000000000000022
[0 - 1554f8036000]    8.970253 {1}{inst}: instance layout: inst=4000000000000022 layout=Layout(bytes=3504, align=7008, fields={0=0+0}, lists=[[<0>..<0>->affine(<3504>+0)]])
[0 - 1554f8036000]    8.970255 {1}{inst}: allocation completed: inst=4000000000000022 offset=133152
[0 - 1554f8036000]    8.970257 {2}{inst}: instance created: inst=4000000000000022 external=memory(base=154c8e020820, size=3504) ready=0
...
[0 - 1554f8053000]   15.294676 {2}{inst}: instance destroyed: inst=4000000000000022 wait_on=0
[0 - 1554f8053000]   15.294679 {1}{inst}: deallocation completed: inst=4000000000000022
[0 - 1554f8053000]   15.294771 {2}{inst}: releasing local instance: 4000000000000022

but I never see the instances get used anywhere.

The output of the logging wrapper is here: http://sapling2.stanford.edu/~seshu/s3d_tdb/instances/run_0.log

A profile is here: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_tdb/instances/legion_prof/

elliottslaughter commented 2 months ago

Could these be future instances?

elliottslaughter commented 2 months ago

Or else, could it be the deferred buffers we create when the kernel launch arguments overflow the limit?

https://gitlab.com/StanfordLegion/legion/-/blob/master/language/src/regent/gpu/helper.t#L227

lightsighter commented 2 months ago

Could be either. Both would show up as external instances. Future instances would occur if they were buffers being returned from the application for Legion to take ownership of as the future result. Deferred buffers would look like external instances made on top of the eager allocation pool. Given the size quoted here of 3504 bytes, I'm going to guess that the second guess is the more likely case.

lightsighter commented 2 months ago

Although if they are in system memory, that means they are not being used for the GPU and might therefore be more likely to be futures.

syamajala commented 2 months ago

Is there some way we could get provenance for these instances?

lightsighter commented 2 months ago

Legion is logging the creator operation for each instance, so the profiler can look up the provenance string for that operation. Although it doesn't do that today it could do that without any changes to the logging interface. Every instance has the name of the operation that created it: https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/legion/legion_profiling.h?ref_type=heads#L367 and every operation has a provenance string: https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/legion/legion_profiling.h?ref_type=heads#L185

I'm hesitant to add provenance strings to the task postamble and deferred buffer classes. In the case of the postamble the instances don't get made right away so we'd have to copy the provenance string and store it on the heap for indeterminate amount of time. We might also need to copy it around between nodes if the future data moves without ever creating an instance. The case is better for the deferred buffers since their instances get made right away, but their interface is already a mess and I don't want to make it more messy than it already is.

elliottslaughter commented 2 months ago

If the profiling change is sufficient, let's do that?

lightsighter commented 2 months ago

It should at least tell you which operation is making the instances. It won't tell you exactly which line of code is responsible though, but maybe it is close enough.

Are we sure these instances are actually in the system memory and not in the zero-copy memory?

elliottslaughter commented 2 months ago

I think we confirmed via logging in the compiler that these instances are the result of spilling arguments for CUDA kernels, and there is a fairly straightforward path to splitting the kernels up so we don't need to spill so much.

lightsighter commented 2 months ago

Ok, so they were going in the zero-copy memory then instead of the system memory right? That way they were visible on the host for scribbling and on the device for reading.

elliottslaughter commented 2 months ago

Yes, Regent places spill arguments into zero-copy memory. I'm not sure why Seshu would have seen them in system memory, the code to put them in zero copy is right here:

https://gitlab.com/StanfordLegion/legion/-/blob/master/language/src/regent/gpu/helper.t#L1103

lightsighter commented 2 months ago

If @syamajala can confirm that he was actually seeing them in the sysmem and not the zero-copy memory, I suspect there might actually be a bug in Realm. These instances would have been eagerly allocated out of the eager pool for the zero-copy memory by getting a pointer into the zero-copy memory. To make an instance for use by the deferred buffer object, Legion would ask Realm to do an external instance creation based on the pointer. Legion asks Realm to pick the "suggested memory" for that instance to go into. My guess is that Realm is failing to recognize that the pointer is actually contained within the zero-copy memory when it does the look-up for the suggested memory based on the pointer.

syamajala commented 2 months ago

Yes they are in system memory you can see them in the profile I linked above: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_tdb/instances/legion_prof/

I guess there are a lot in zero copy as well but those all have provenance.

lightsighter commented 2 months ago

I decided I'm not actually going to ask the Realm team to fix this particular issue. It's a pretty obscure case and it's not clear we should be aliasing Realm instances this way. Once we have instance redistricting and I can redo the memory management then we won't be encountering this problem.