StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
686 stars 145 forks source link

Legion: S3D OOM on 1 node #1516

Open syamajala opened 1 year ago

syamajala commented 1 year ago

We are hitting OOM at 1 node for the production case of S3D that we want to run on Frontier. I ran the mem_trace tool and here is what I've found.

133gb worth of 4096 mallocs:

malloc(4096) = 0x7ff8c4256e50 hash: 15964307415695065575
count: 34946195 size: 133.30915451049805(GB)
stack trace: 12 frames
  [0] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/liblegion.so.1(Legion::Internal::BroadcastCollective::send_messages() const+0x82) [0x7fffe2531402]
  [1] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/liblegion.so.1(Legion::Internal::ReplIndexTask::trigger_prepipeline_stage()+0x463) [0x7fffe2504463]
  [2] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/liblegion.so.1(Legion::Internal::Operation::execute_prepipeline_stage(unsigned int, bool)+0x1dc) [0x7fffe244c40c]
  [3] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/liblegion.so.1(Legion::Internal::InnerContext::process_prepipeline_stage()+0x40e) [0x7fffe231c43e]
  [4] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/liblegion.so.1(Legion::Internal::InnerContext::handle_prepipeline_stage(void const*)+0xd) [0x7fffe232c44d]
  [5] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/liblegion.so.1(Legion::Internal::Runtime::legion_runtime_task(void const*, unsigned long, void const*, unsigned long, Realm::Processor)+0xf8) [0\
x7fffe27c8b78]
  [6] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/librealm.so.1(+0x649eed) [0x7fffe12adeed]
    Realm::LocalTaskProcessor::execute_task(unsigned int, Realm::ByteArrayRef const&)
  [7] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/librealm.so.1(+0x68794d) [0x7fffe12eb94d]
    Realm::Thread::stop_operation(Realm::Operation*)
  [8] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/librealm.so.1(+0x68dea3) [0x7fffe12f1ea3]
    Realm::UserThreadTaskScheduler::execute_task(Realm::Task*)
  [9] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/librealm.so.1(+0x68b18f) [0x7fffe12ef18f]
  [10] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/librealm.so.1(+0x695dcd) [0x7fffe12f9dcd]
    Realm::UserThread::uthread_entry()
  [11] = /lib64/libc.so.6(+0x61600) [0x7fffe4a72600]

136gb worth of 4180 mallocs:

malloc(4180) = 0x7ff8c4258810 hash: 1996435970650630379
count: 34946194 size: 136.0430297628045(GB)
stack trace: 17 frames
  [0] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/librealm.so.1(+0x647bf7) [0x7fffe12abbf7]
    Realm::Serialization::DynamicBufferSerializer::DynamicBufferSerializer(unsigned long)
  [1] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/librealm.so.1(Realm::Processor::spawn(unsigned int, void const*, unsigned long, Realm::Event, int) const+0x99) [0x7fffe12a8f29]
  [2] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/liblegion.so.1(Legion::Internal::VirtualChannel::send_message(bool, Legion::Internal::Runtime*, Realm::Processor, Legion::Internal::MessageKind,\
 bool, bool, Legion::Internal::RtEvent)+0x1de) [0x7fffe2772a2e]
  [3] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/liblegion.so.1(Legion::Internal::VirtualChannel::package_message(Legion::Serializer&, Legion::Internal::MessageKind, bool, Legion::Internal::RtE\
vent, Legion::Internal::Runtime*, Realm::Processor, bool, bool)+0x2b1) [0x7fffe2772771]
  [4] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/liblegion.so.1(Legion::Internal::Runtime::send_control_replicate_collective_message(unsigned int, Legion::Serializer&)+0x37) [0x7fffe27b0d77]
  [5] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/liblegion.so.1(Legion::Internal::BroadcastCollective::send_messages() const+0x138) [0x7fffe25314b8]
  [6] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/liblegion.so.1(Legion::Internal::ReplIndexTask::trigger_prepipeline_stage()+0x463) [0x7fffe2504463]
  [7] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/liblegion.so.1(Legion::Internal::Operation::execute_prepipeline_stage(unsigned int, bool)+0x1dc) [0x7fffe244c40c]
  [8] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/liblegion.so.1(Legion::Internal::InnerContext::process_prepipeline_stage()+0x40e) [0x7fffe231c43e]
  [9] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/liblegion.so.1(Legion::Internal::InnerContext::handle_prepipeline_stage(void const*)+0xd) [0x7fffe232c44d]
  [10] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/liblegion.so.1(Legion::Internal::Runtime::legion_runtime_task(void const*, unsigned long, void const*, unsigned long, Realm::Processor)+0xf8) [\
0x7fffe27c8b78]
  [11] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/librealm.so.1(+0x649eed) [0x7fffe12adeed]
    Realm::LocalTaskProcessor::execute_task(unsigned int, Realm::ByteArrayRef const&)
  [12] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/librealm.so.1(+0x68794d) [0x7fffe12eb94d]
    Realm::Thread::stop_operation(Realm::Operation*)
  [13] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/librealm.so.1(+0x68dea3) [0x7fffe12f1ea3]
    Realm::UserThreadTaskScheduler::execute_task(Realm::Task*)
  [14] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/librealm.so.1(+0x68b18f) [0x7fffe12ef18f]
  [15] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/librealm.so.1(+0x695dcd) [0x7fffe12f9dcd]
    Realm::UserThread::uthread_entry()
  [16] = /lib64/libc.so.6(+0x61600) [0x7fffe4a72600]
syamajala commented 1 year ago

There is an additional 19gb worth of allocations here as well:

malloc(4096) = 0x7ff8c4255510 hash: 9328455155272471744
count: 4992314 size: 19.044166564941406(GB)
stack trace: 11 frames
  [0] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/liblegion.so.1(Legion::Internal::ReplIndexTask::trigger_prepipeline_stage()+0x346) [0x7fffe2504346]
  [1] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/liblegion.so.1(Legion::Internal::Operation::execute_prepipeline_stage(unsigned int, bool)+0x1dc) [0x7fffe244c40c]
  [2] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/liblegion.so.1(Legion::Internal::InnerContext::process_prepipeline_stage()+0x40e) [0x7fffe231c43e]
  [3] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/liblegion.so.1(Legion::Internal::InnerContext::handle_prepipeline_stage(void const*)+0xd) [0x7fffe232c44d]
  [4] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/liblegion.so.1(Legion::Internal::Runtime::legion_runtime_task(void const*, unsigned long, void const*, unsigned long, Realm::Processor)+0xf8) [0x7fffe27c8b78]
  [5] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/librealm.so.1(+0x649eed) [0x7fffe12adeed]
        Realm::LocalTaskProcessor::execute_task(unsigned int, Realm::ByteArrayRef const&)
  [6] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/librealm.so.1(+0x68794d) [0x7fffe12eb94d]
        Realm::Thread::stop_operation(Realm::Operation*)
  [7] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/librealm.so.1(+0x68dea3) [0x7fffe12f1ea3]
        Realm::UserThreadTaskScheduler::execute_task(Realm::Task*)
  [8] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/librealm.so.1(+0x68b18f) [0x7fffe12ef18f]
  [9] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/librealm.so.1(+0x695dcd) [0x7fffe12f9dcd]
        Realm::UserThread::uthread_entry()
  [10] = /lib64/libc.so.6(+0x61600) [0x7fffe4a72600]

All three stack traces are from a single rank.

lightsighter commented 1 year ago

The counts of allocations from these location is completely ridiculous. Either the window wait is just hopelessly broken (doubtful), or you turned off the window wait by trying to use frames, and you've set the number of allowed outstanding frames in your mapper to be so large that it allows for tens of millions of outstanding tasks. How many frames are you allowing outstanding at a time? Please add prints to your top level task to confirm how many outstanding frames you see running at the same time.

syamajala commented 1 year ago

We are using frames.

https://gitlab.com/legion_s3d/legion_s3d/-/blob/regent_nscbc/rhst/rhst_mapper.cc?ref_type=heads#L1088-1095

min_frames_to_schedule in configure_context() is set to 2 we are running with the default max_outstanding_frames which is 2.

syamajala commented 1 year ago

I guess setting min_frames_to_schedule to 0 and max_outstanding_frames to 2 fixes this issue.

syamajala commented 1 year ago

The full production case is working for me too now. I set min_frames_to_schedule to 0 and max_outstanding_frames to 10.

elliottslaughter commented 1 year ago

I don't want to belabor this if Seshu and Mike are satisfied, but to me it doesn't feel like we've fully debugged this.

Mike's comment in https://github.com/StanfordLegion/legion/issues/1516#issuecomment-1656363726 seems to suggest a number of issued frames substantially larger than 2. But the min_frames_to_schedule reported by Seshu is only 2, a value equal to the max_outstanding_frames and hardly large enough to cause a problem. (The S3D frame has many tasks, but not that many.)

Is it possible we're looking at a real scheduler bug here?

lightsighter commented 1 year ago

I'm not satisfied either. This bug needs to stay open until we have a minimum reproducer and have fixed it. If there is an issue with frames, it should be nearly trivial to reproduce it. @syamajala please try to make a reproducer that is not all of S3D.

syamajala commented 1 year ago

For whatever reason I am not able to reproduce this issue anymore when I go back to the old commit of S3D. I only changed 1 line in the mapper...

lightsighter commented 1 year ago

Is it possible you were linking against a different version of Legion?