StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
675 stars 145 forks source link

Legion: seg fault with -lg:no_trace_optimization #1442

Open syamajala opened 1 year ago

syamajala commented 1 year ago

I had to lift the tracing in S3D to trace multiple timesteps at once. The trace optimization is taking a very long time though, but does eventually finish. Running with -lg:no_trace_optimization seg. faults.

When running in debug mode I'm hitting this assertion:

s3d.x: /scratch2/seshu/legion_s3d_viz2/legion/runtime/legion/legion_trace.cc:9594: virtual void Legion::Internal::GetTermEvent::execute(std::vector<Legion::Internal::ApEvent>&, std::map<unsigned int, Legion::Internal::ApUserEvent>&, std::map<Legion::Internal::ContextCoordinate, Legion::Internal::MemoizableOp*>&, bool): Assertion `operations.find(owner) != operations.end()' failed.

There are some hung processes on g0001. PIDs: 449568, 449567, and 449563

syamajala commented 1 year ago

I cant seem to scale with tracing because we start hitting OOM errors. I'm guessing its due to the trace optimizations, so I will run without tracing for now.

syamajala commented 1 year ago

I'm not sure if we can scale without tracing either, performance doesnt look super great: http://sapling.stanford.edu/~seshu/viz/average_21to30_iteration.html

lightsighter commented 1 year ago

The trace optimization is taking a very long time though, but does eventually finish. Running with -lg:no_trace_optimization seg. faults.

Unless this is really important I'm going to ignore this without segfault until after the control replication merge as I'm pretty sure I know what is causing it and the answer is that the tracing framework needs to be refactored to just make this entire instruction go away.

I cant seem to scale with tracing because we start hitting OOM errors.

What makes you say that the tracing is causing the OOM errors?

syamajala commented 1 year ago

I guess its not really important to fix this right now. We have gotten the visualization library integrated with S3D. It produces some high quality images and based on the profile it does not seem to add any overhead to the simulation. It would be nice to scale on Frontier. I'm guessing that the tracing rewrite is something that is going to take a couple of months to fix? I think I have gotten things to look as good as I possibly can without tracing: http://sapling.stanford.edu/~seshu/viz/legion_prof.6/ I will try doing one more set of runs with my latest fixes and see what scaling looks like.

I think it is specifically the tracing optimization that is causing OOM. I can run without tracing and also with -lg:no_physical_tracing without hitting OOM, although based on that profile above it does seem like logical tracing may not be working? Is there a way to debug why?

lightsighter commented 1 year ago

I'm guessing that the tracing rewrite is something that is going to take a couple of months to fix?

I think it will be considerably less than that, but it is blocked on the control replication merge.

I think it is specifically the tracing optimization that is causing OOM.

Can you get backtraces when you OOM? I'd be pretty surprised if any of the tracing optimizations consumed that much memory, they should all be O(N) in memory in the size of the graph. The computational cost of some of them like the transitive reduction are quite a bit higher, but that's different than the memory cost.

although based on that profile above it does seem like logical tracing may not be working? Is there a way to debug why?

It looks normal to me given this version of logical tracing that still requires doing the region tree traversals. There is a new one in shardrefine (to go with the new refinement algorithms) that doesn't require the traversals to be done.

syamajala commented 10 months ago

This issue still appears to be there with shardrefine. I either see the transitive reduction take a very long time: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/viz/legion_prof/

or a seg fault when I try running with -lg:no_trace_optimization:

[26] Thread 5 (Thread 0x7fff9adfeb80 (LWP 42519) "s3d.x"):
[26] #0  0x00007fffe4af274f in wait4 () from /lib64/libc.so.6
[26] #1  0x00007fffe4a69ba7 in do_system () from /lib64/libc.so.6
[26] #2  0x00007fffe1c6c556 in gasneti_system_redirected () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_viz/legion/language/build/lib/librealm.so.1
[26] #3  0x00007fffe1c6befb in gasneti_bt_gdb () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_viz/legion/language/build/lib/librealm.so.1
[26] #4  0x00007fffe1c6272f in gasneti_print_backtrace () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_viz/legion/language/build/lib/librealm.so.1
[26] #5  0x00007fffe1d7fb8a in gasneti_defaultSignalHandler () from /lustre/orion/cmb138/scratch/seshuy/legion_s3d_viz/legion/language/build/lib/librealm.so.1
[26] #6  <signal handler called>
[26] #7  0x00007fffe2ac3fd9 in Legion::Internal::GetTermEvent::execute (this=0x7ff5c4b72b00, events=..., user_events=..., operations=..., recurrent_replay=<optimized out>) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_viz/legion/runtime/legion/legion_trace.cc:10247
[26] #8  0x00007fffe2ab7275 in Legion::Internal::PhysicalTemplate::execute_slice (this=<optimized out>, slice_idx=<optimized out>, recurrent_replay=<optimized out>) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_viz/legion/runtime/legion/legion_trace.cc:5243
[26] #9  Legion::Internal::PhysicalTemplate::handle_replay_slice (args=<optimized out>) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_viz/legion/runtime/legion/legion_trace.cc:8237
[26] #10 0x00007fffe2c6d0dc in Legion::Internal::Runtime::legion_runtime_task (args=0x7fd5259ed980, arglen=<optimized out>, userdata=<optimized out>, userlen=<optimized out>, p=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_viz/legion/runtime/legion/runtime.cc:32670
[26] #11 0x00007fffe16f952d in Realm::LocalTaskProcessor::execute_task (this=0x49a7370, func_id=4, task_args=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_viz/legion/runtime/realm/proc_impl.cc:1175
[26] #12 0x00007fffe1739d9c in Realm::Task::execute_on_processor (this=0x7fd5259ed800, p=...) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_viz/legion/runtime/realm/tasks.cc:326
[26] #13 0x00007fffe17403b3 in Realm::UserThreadTaskScheduler::execute_task (this=<optimized out>, task=0x0) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_viz/legion/runtime/realm/tasks.cc:1687
[26] #14 0x00007fffe173d69f in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x4918210) at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_viz/legion/runtime/realm/tasks.cc:1160
[26] #15 0x00007fffe17482dd in Realm::UserThread::uthread_entry () at /lustre/orion/cmb138/scratch/seshuy/legion_s3d_viz/legion/runtime/realm/threads.cc:1355
[26] #16 0x00007fffe4a72600 in ?? () from /lib64/libc.so.6
[26] #17 0x0000000000000000 in ?? ()
lightsighter commented 10 months ago

I either see the transitive reduction take a very long time:

The long transitive reduction is not preventing other things from running. I don't understand what you think is wrong with that profile?

or a seg fault when I try running with -lg:no_trace_optimization:

The same thing I said before still applies.

Unless this is really important I'm going to ignore this without segfault until after the control replication merge as I'm pretty sure I know what is causing it and the answer is that the tracing framework needs to be refactored to just make this entire instruction go away.

This doesn't have anything to do with shardrefine. The tracing code just needs to be ripped out and rewritten from scratch.

syamajala commented 10 months ago

I merged the vis branch with the subrank branch and am trying to scale on Frontier. We're never shutting down due to the transitive reduction. Even just on 1 node I let it run for 10 minutes after the timestep loop finished and it was still trying to do the transitive reduction.

lightsighter commented 10 months ago

There is a -lg:no_transitive_reduction flag now. Try it out.

syamajala commented 10 months ago

Will try to test this in the next day or two.

syamajala commented 10 months ago

This seems to be working. Given that you want to rewrite tracing and the original problem isn't going to be fixed but we have a work around now should I close this issue or leave it open?

lightsighter commented 10 months ago

Let's leave it open for now, but reference it from #407 so we don't lose track of it.