Open syamajala opened 1 year ago
There is an additional 19gb worth of allocations here as well:
malloc(4096) = 0x7ff8c4255510 hash: 9328455155272471744
count: 4992314 size: 19.044166564941406(GB)
stack trace: 11 frames
[0] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/liblegion.so.1(Legion::Internal::ReplIndexTask::trigger_prepipeline_stage()+0x346) [0x7fffe2504346]
[1] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/liblegion.so.1(Legion::Internal::Operation::execute_prepipeline_stage(unsigned int, bool)+0x1dc) [0x7fffe244c40c]
[2] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/liblegion.so.1(Legion::Internal::InnerContext::process_prepipeline_stage()+0x40e) [0x7fffe231c43e]
[3] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/liblegion.so.1(Legion::Internal::InnerContext::handle_prepipeline_stage(void const*)+0xd) [0x7fffe232c44d]
[4] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/liblegion.so.1(Legion::Internal::Runtime::legion_runtime_task(void const*, unsigned long, void const*, unsigned long, Realm::Processor)+0xf8) [0x7fffe27c8b78]
[5] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/librealm.so.1(+0x649eed) [0x7fffe12adeed]
Realm::LocalTaskProcessor::execute_task(unsigned int, Realm::ByteArrayRef const&)
[6] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/librealm.so.1(+0x68794d) [0x7fffe12eb94d]
Realm::Thread::stop_operation(Realm::Operation*)
[7] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/librealm.so.1(+0x68dea3) [0x7fffe12f1ea3]
Realm::UserThreadTaskScheduler::execute_task(Realm::Task*)
[8] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/librealm.so.1(+0x68b18f) [0x7fffe12ef18f]
[9] = /lustre/orion/cmb138/scratch/seshuy/legion_s3d_flow_control/legion/language/build/lib/librealm.so.1(+0x695dcd) [0x7fffe12f9dcd]
Realm::UserThread::uthread_entry()
[10] = /lib64/libc.so.6(+0x61600) [0x7fffe4a72600]
All three stack traces are from a single rank.
The counts of allocations from these location is completely ridiculous. Either the window wait is just hopelessly broken (doubtful), or you turned off the window wait by trying to use frames, and you've set the number of allowed outstanding frames in your mapper to be so large that it allows for tens of millions of outstanding tasks. How many frames are you allowing outstanding at a time? Please add prints to your top level task to confirm how many outstanding frames you see running at the same time.
We are using frames.
min_frames_to_schedule in configure_context() is set to 2 we are running with the default max_outstanding_frames which is 2.
I guess setting min_frames_to_schedule to 0 and max_outstanding_frames to 2 fixes this issue.
The full production case is working for me too now. I set min_frames_to_schedule to 0 and max_outstanding_frames to 10.
I don't want to belabor this if Seshu and Mike are satisfied, but to me it doesn't feel like we've fully debugged this.
Mike's comment in https://github.com/StanfordLegion/legion/issues/1516#issuecomment-1656363726 seems to suggest a number of issued frames substantially larger than 2. But the min_frames_to_schedule
reported by Seshu is only 2
, a value equal to the max_outstanding_frames
and hardly large enough to cause a problem. (The S3D frame has many tasks, but not that many.)
Is it possible we're looking at a real scheduler bug here?
I'm not satisfied either. This bug needs to stay open until we have a minimum reproducer and have fixed it. If there is an issue with frames, it should be nearly trivial to reproduce it. @syamajala please try to make a reproducer that is not all of S3D.
For whatever reason I am not able to reproduce this issue anymore when I go back to the old commit of S3D. I only changed 1 line in the mapper...
Is it possible you were linking against a different version of Legion?
We are hitting OOM at 1 node for the production case of S3D that we want to run on Frontier. I ran the mem_trace tool and here is what I've found.
133gb worth of 4096 mallocs:
136gb worth of 4180 mallocs: