StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
685 stars 145 forks source link

Regent: Leaking index and field spaces #606

Open magnatelee opened 5 years ago

magnatelee commented 5 years ago

Regent tasks are currently leaking index and field spaces when they exit. The compiler should emit code to clean them up.

syamajala commented 5 years ago

Is there a work around for this issue in the short term? The runtime warnings are really annoying.

lightsighter commented 5 years ago

Th runtime warnings should all be off by default now unless you explicitly enable them.

magnatelee commented 4 years ago

I want to make a note that many of the Regent tests are also leaking layout constraints.

elliottslaughter commented 4 years ago

Aren't our layout constraints all static? How exactly are we supposed to clean them up? Do we need a cleanup think to clean up the results of our registration thunk?

magnatelee commented 4 years ago

It's a consequence of field spaces not being destroyed; field spaces are holding references to layout constraints, so until they are destroyed, those references are still alive.

elliottslaughter commented 7 months ago

I think we fixed this a long time ago?

lightsighter commented 7 months ago

Do you want to try turning on the runtime checks and confirming that? 😉

elliottslaughter commented 7 months ago

We've been through several rounds of that and have fixed various issues along the way. I don't think we run legion_gc.py in CI (as far as I know) but I believe we've cleared several large applications of leaks and am not aware of any outstanding ones...

lightsighter commented 7 months ago

I didn't mean legion_gc.py. I meant running with the -lg:leaks flag which will give you a warning if you leak any resources that you created. If you convert warnings to errors you'll also get hard crashes in the CI if you are leaking anything.

elliottslaughter commented 6 months ago

If you convert warnings to errors

How do you do this?

lightsighter commented 6 months ago

Build with -DLEGION_WARNINGS_FATAL

elliottslaughter commented 6 months ago

I can't actually run the Regent test suite in this mode because it causes warnings like the following to turn into errors as well:

$ ./regent.py tests/regent/run_pass/call_task_polymorphic1.rg
[0 - 7ff84f868100]    0.000151 {4}{threads}: reservation ('utility proc 1d00000000000000') cannot be satisfied
[0 - 70000b5c6000]    0.103747 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 0 of operation t1 (UID 3, provenance: tests/regent/run_pass/call_task_polymorphic1.rg:91) in parent task main (UID 2) is using uninitialized data for field(s) 102,103 of logical region (1,1,1) (from file /Users/elliott/Programming/Legion/legion/runtime/legion/legion_ops.cc:1838)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071

What I want is to make only the warnings from -lg:leaks into errors.

Anyway, there is a change in https://gitlab.com/StanfordLegion/legion/-/merge_requests/1257 which cleans up the leaks I am currently aware of (and have had the patience to manually check, since I cannot rely on an automated test suite to confirm).

elliottslaughter commented 6 months ago

Actually it's worse than that because I have to build with -DLEGION_WARNINGS_FATAL to get the list of tests that might be hitting leak errors, then recompile without -DLEGION_WARNINGS_FATAL so that I can actually run them to completion to see if they actually leak (because if hit some other warning-turned-into-an-error first, I can't know if the code would have leaked if it actually ran to completion, because execution is terminated prematurely).

So I think I cannot validate the current situation at all except through extensive manual effort, which I am not willing to do right now.

lightsighter commented 6 months ago

You can manually turn the warnings in this function into errors to get the behavior you want: https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/legion/legion_context.cc?ref_type=heads#L6334-6523

I don't think it's a common enough use case to do this generally for everyone.

elliottslaughter commented 6 months ago

Just for posterity, the issue I am seeing now (even without leak checks or hard errors) is that index spaces are being deleted too eagerly. A sample reproducer and some explanatory comments are available here:

https://gitlab.com/StanfordLegion/legion/-/merge_requests/1257#note_1901183719

lightsighter commented 5 months ago

I've removed the index space requirement checks in the inordercommit branch since they don't do anything anymore there. That branch should make it into the master branch later today.

elliottslaughter commented 5 months ago

I think that https://gitlab.com/StanfordLegion/legion/-/merge_requests/1257 is ready to go now, just waiting for users to confirm that it doesn't break anything.

lightsighter commented 5 months ago

Good to go from my side if you are happy with it. The inordercommit branch has merged.

elliottslaughter commented 5 months ago

There are two major caveats currently remaining:

Caveat 1: Some Programs Rely on Leaks

For example, S3D has code that looks like:

task make_partition(...)
  var part = ...
  var part_color = c.legion_index_partition_get_color(..., __raw(part))
  return part_color
end

The compiler has a very simple escape analysis that would detect if part were returned directly. But because we're getting the __raw value and passing that into a C API call, and then returning the color, I see basically no reasonable way for the compiler to determine that there's a dataflow here.

I see two possible solutions:

  1. Assume that __raw always escapes. If you ever use __raw for any purpose, we'll leak the object.
  2. Add a separate __leak operator to explicitly leak objects so that we do not collect the partition at the end of the task. This means that some user programs (like S3D) will need to be modified to account for this change.

Caveat 2: Tracing does not Support Deletion

This can get really subtle because Regent has to create implicit index spaces in some cases. Here are some examples of code that creates index spaces implicitly:

for i = 0, N do
  c.legion_begin_trace(...)
  __demand(__index_launch)
  for j = 0, M do
    ...
  end
  c.legion_end_trace(...)
end

Regent has to create an index space [0, M) for the j loop in order to do the index launch. Previously, we simply leaked all such index spaces. Now we collect them once the launch is issued; but this means that we now cannot trace the loop. This has to be fixed by creating an index space explicitly:

var M_space = ispace(int1d, M)
for i = 0, N do
  c.legion_begin_trace(...)
  __demand(__index_launch)
  for j in M_space do
    ...
  end
  c.legion_end_trace(...)
end

This is obnoxious but at least reasonably straightforward. But here's a case where the reason why the index space gets created is pretty non-obvious:

for i = 0, N do
  c.legion_begin_trace(...)
  some_task_launch(...)
  c.legion_end_trace(...)
end

Do you see the index space? No?

It's on the very first line for i = 0, N. The reason is that when Regent launches some_task_launch, it tries to be helpful and assumes that [0, N) is the sharding space for the launch. This is maybe not the best choice in this case, but it makes more sense in cases where we have inner loops that cannot be turned into index spaces: at least Regent can associate a sharding space and point to each individual task and more or less get the right distribution.

I don't see a way to fix this without some heavy-duty LICM in the compiler. And worse, because the user can call C API methods directly for the tracing, we don't even necessarily know that the code is going to be traced.

This could also be fixed by the runtime supporting index space deletion inside traces, but I don't know if that is anywhere on the horizon.

lightsighter commented 5 months ago

This could also be fixed by the runtime supporting index space deletion inside traces, but I don't know if that is anywhere on the horizon.

I'm not going to support deletion inside traces because those traces are immediately non-replayable forever because you can't delete the same index space (or any resource really) twice.

I don't see a way to fix this without some heavy-duty LICM in the compiler.

Why don't you just move all deletions of implicit index spaces to the very end of a task body? I realize that might create a sub-optimal lifetime for some resources (especially in long-running tasks), but it would be sound. It's like when C++ implicitly destructs objects on the stack at the end of a scope.

elliottslaughter commented 5 months ago

the very end of a task body? [...] It's like when C++ implicitly destructs objects on the stack at the end of a scope.

You're right about it being sound (in the part of this quote I elided), but there is a direct contradiction here: (a) end of task body vs (b) end of a scope.

Regent is currently attempting to do (b), just like C++ with RAII. A temporary object has no scope so it gets destroyed immediately. Thus the issue.

I could do (a), but I'm not sure if that's really better. It does allow tracing but you're accumulating a potentially unbounded number of index spaces, which is still a de facto leak (even if they'll all be collected by the end of the task).

lightsighter commented 5 months ago

I guess if your doing it via scopes, then in the case of this code:

for i = 0, N do
  c.legion_begin_trace(...)
  __demand(__index_launch)
  for j = 0, M do
    ...
  end
  c.legion_end_trace(...)
end

why isn't the "end of the scope" after the last statement, e.g. the end trace call? So the deletion is done outside of the trace.

elliottslaughter commented 5 months ago

The thing to remember is that these are just demonstrative cases and actual code can get arbitrarily more complicated. For example, here's one variation that I had forgotten about in my original comment:

var M : int
while ... do
  c.legion_begin_trace(...)
  M = 0
  for i = 0, N do
    M += 1 -- or any other arbitrary computation
    __demand(__index_launch)
    for j = 0, M do
      ...
    end
  end
  c.legion_end_trace(...)
end

M is now a loop-carried dependency. (Or we can model this as any other arbitrary computation.) This means there is no opportunity to do LICM; we are simply stuck creating it inline.

Additionally there can be any level of nesting between the traced loop and our index launch. Or even function calls. Not every Regent application has the entirety of its main loop exposed in this way, there can be additional levels of indirection through __demand(__local) task calls.

Why is it not possible to defer the index space deletion to the end of the trace? That also seems like a sound implementation. If you support index space creation it seems like the inverse operation should also be supported.

lightsighter commented 5 months ago

LICM is probably want we really want here for optimization purposes, but absent that, this feels like the right semantics.

If you're worried about the performance of back-to-back trace replays, we can actually safely add deletion operations to the list of special operations we allow between traces without invalidating preconditions since by definition deletions can't mess with preconditions for a trace since that would be use-after-free. I can put that optimization in so that even if you have those implicit deletions between traces you'll still get the fast back-to-back trace replays.

lightsighter commented 5 months ago

Why is it not possible to defer the index space deletion to the end of the trace? That also seems like a sound implementation. If you support index space creation it seems like the inverse operation should also be supported.

We don't support index space creation inside of traces either. It happens to work in the case of immediate index space creations because they don't produce operations that are captured by the trace. If you tried to do a deferred index space creation (e.g. from a future) that will result in an error.

As for why you can't delete an index space inside a trace. The region requirements for that deletion operation the next replay of the trace would have to be different which violates the requirements of tracing.

elliottslaughter commented 5 months ago

So the only way I see to implement this at the moment is to modify the C API to buffer index space deletions and apply them all after the legion_runtime_end_trace call. Everything else is insufficient because of (a) local task launches, (b) across compilation units.

The problem with doing it at the C API layer is that if the user uses a C++ library that does tracing, I can't capture that. But I suppose this is enough of a corner case that maybe it's fine.

lightsighter commented 5 months ago

So the only way I see to implement this at the moment is to modify the C API to buffer index space deletions and apply them all after the legion_runtime_end_trace call. Everything else is insufficient because of (a) local task launches, (b) across compilation units.

Can you explain why that is? Why can't Regent just look at the syntax of the body of the loop and say I'm going to put the deletion statements after all the statements in the scope? It shouldn't even need to care whether the statements in the scope are Regent statements or some black-box external C statements.

elliottslaughter commented 5 months ago

Fundamentally, tracing uses dynamic scope, while loop bodies are lexical scope. That means you can end up in situations like:

__demand(__local)
task rk_stage(N : int)
  __demand(__index_launch)
  for i = 0, N do -- implicit index space created here
    ...
  end
  ...
end

And then use it later like:

for t = 0, T do
  c.legion_runtime_begin_trace()
  for rk = 0, 4 do
    rk_stage(N)
  end
  ... -- other stuff
  c.legion_runtime_end_trace()
end

In theory these tasks could even be in different compilation units and thus not visible to the compiler at the same time. And again, this is just a demonstrative example, and real codes can apply arbitrary additional levels of loops, scopes, task calls (as long as they are local), etc.

There is in general no meaningful way for Regent to lift the implicit index spaces out far enough such that they would be outside of the enclosing trace calls.

lightsighter commented 5 months ago

Yet another reason for automatic tracing.

lightsighter commented 5 months ago

I guess the other thing I'm confused about is why you need to make an implicit index space. We still support "immediate" domains for index space launches: https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/legion.h?ref_type=heads#L1675 You shouldn't even need to make an implicit index space in these cases.

lightsighter commented 5 months ago

The only reason to make an implicit index space is if you're doing LICM and you're going to reuse the index space over and over.

elliottslaughter commented 5 months ago

The sharding space for single task launches must always be a full index space:

https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/legion.h?ref_type=heads#L1574

I need to look at the index task launch case further; I do not immediately see where in the compiler we're actually making index spaces. The code that I do see appears to make only domains, as you suggest. (And the sharding space in that case is unset, so it inherits from the launch domain.)

elliottslaughter commented 5 months ago

Ok, I see it now. Regent was lifting += into a task and issuing a single task launch for that task, which it then subsequently decided to assign a sharding space. The sharding space goes through the above API which forces a conversion into a full index space.

elliottslaughter commented 4 months ago
  1. I implemented 1 level of LICM for the sharding spaces associated with loops. They are now generated before the loop, preserved through the entire loop body, and then cleanup is deferred after the loop finishes. This has been effective at least in the cases I've tested here for getting the subsequent deletions out of traces. In principle, a sufficiently deeply nested loop (especially with functions) would defeat this, but it's been good enough for all our application so far.
  2. Looking at my use case 1 from https://github.com/StanfordLegion/legion/issues/606#issuecomment-2115922544, it appears that this is a non-issue because I simply don't delete index partitions 🤣. So that will need to be dealt with at some point, but I think not yet. If we recursively delete from the root index space then it will not be an issue for leaks at program termination, though it could potentially contribute to memory usage growth during application execution. However, as the title of this issue indicates, this is mainly about index and field spaces, so I think I will defer this.
lightsighter commented 4 months ago

because I simply don't delete index partitions 🤣. So that will need to be dealt with at some point, but I think not yet. If we recursively delete from the root index space then it will not be an issue for leaks at program termination, though it could potentially contribute to memory usage growth during application execution.

Right, this will only apply to use cases where you are making lots of index partitions over the lifetime of the application. If you're making just a few and reusing them then it won't be an issue, and when you delete the root index space Legion will just clean everything up.

elliottslaughter commented 4 months ago

Re: my point about LICM here https://github.com/StanfordLegion/legion/issues/606#issuecomment-2197683702. Unfortunately this is not sufficient. S3D has deeper loop nesting which (unsurprisingly) causes my simple hack to fail. So now we're back to these options:

  1. Disable sharding of single tasks entirely (perhaps via a flag). It'll work well enough if the application uses 100% index launches, but it will fall over horribly if the user accidentally misses an index launch. I'm exploring this with S3D to confirm it works, but it's strictly speaking a breaking change as some applications will need this flag now to function correctly.
  2. Implement real LICM. However there is a danger even this will be defeated by local task launches, if we do not implement an interprocedural analysis. (I don't really like this option in either variant. Even the single-task optimization is more than I really want to bite off right now and it can still fail in obscure ways when it doesn't work.)
  3. Modify Legion to allow either:
    1. Assigning immediate index spaces as the sharding spaces for single task launches.
    2. Permit deleting index spaces during traces.
elliottslaughter commented 4 months ago
  1. Try to find a clever way to distinguish between loops where single task launches require sharding and ones where that's not required. For example, a loop that contains both index launches and single task launches is likely to be an outer loop, and not simply a loop where the user has forgotten to annotate __demand(__index_launch) and didn't realize it wasn't being applied. Therefore we could reasonably shut off sharding in that case because it likely wouldn't have helped anyway, and isn't likely to be catastrophic to performance. However, we are still making assumptions about the structure of user programs (e.g., that users do not mix index launches and single task launches in the same outer loop where the user would expect the single task launches to get sharded).