Closed LonelyCat124 closed 3 years ago
Can you capture Legion Prof profiles running both ways?
Profiles both ways are available here: https://drive.google.com/drive/folders/1tU7x8M21HkPngbUGiLUQ4V3OSuRMgCJL?usp=sharing (~50MB each).
inbuild_prof.gz
is the one using attach etc. while alt_prof.gz
uses HDF5 code directly.
I don't see anything unusual in these profiles from a Legion perspective. The "inbuild" profile looks completely limited by the application CPU processor. The Legion utility processor is not busy at all and neither is the Realm DMA channel which would be moving the HDF5 data. It looks to me like main
is busy for very long stretches of time, often 50-100ms where it is not asking Legion to do any additional work. Whatever it's doing, it's running your application CPU code so you should be able to track down where the time is going. It's somewhere in your application-side code.
Ok - I'll rerun a profile with a standard profiler to see if I can track down what is happening - the two codes are identical outside the code that handles File I/O - i.e. HDF5 calls vs region/attach/release/etc.
Ok - I ran both versions through vtune hotspots collection on HPC machine (1 thread for util & cpu), to ensure no other noise could be interfering.
The manual version hotspots were:
pthread_cond_timedwait
libpthread.so.0 4.569s
std::_Rb_tree<unsigned int, unsigned int, std::_Identity<unsigned int>, std::less<unsigned int>, std::allocator<unsigned int>>::_M_erase
libregent.so 4.289s
Legion::Internal::PhysicalRegionImpl::get_instance_info
libregent.so 2.983s
posix_memalign
libc.so.6 2.821s
std::_Rb_tree<unsigned int, unsigned int, std::_Identity<unsigned int>, std::less<unsigned int>, std::allocator<unsigned int>>::_M_copy<std::_Rb_tree<unsigned int, unsigned int, std::_Identity<unsigned int>, std::less<unsigned int>, std::allocator<unsigned int>>::_Alloc_node> libregent.so 2.590s
so no big hotspots anywhere.
With the inbuilt HDF5 implementation I get:
std::_Rb_tree_increment libstdc++.so.6 64.973s
std::vector<long long, std::allocator<long long>>::_M_realloc_insert<long long const&> libregent.so 58.221s
Legion::Internal::TaskContext::check_region_dependence libregent.so 56.902s
_int_free libc.so.6 48.841s
Legion::PhysicalRegion::is_mapped libregent.so 35.146s
I think the next 2 hotspots might be the creation and deletion of those regions again (Legion::PhysicalRegion::~PhysicalRegion
and Legion::PhysicalRegion::PhysicalRegion
). I'm going to retry modifying the code to reuse the region (over creation and deletion) and see what that profile looks like.
Update: I removed region creation/deletion to just an initial "io" region which uses attach/acquire etc. to do file IO, and I get a similar hotspot to creating regions each time:
std::_Rb_tree_increment libstdc++.so.6 66.887s
std::vector<long long, std::allocator<long long>>::_M_realloc_insert<long long const&> libregent.so 58.083s
Legion::Internal::TaskContext::check_region_dependence libregent.so 56.868s
_int_free libc.so.6 47.759s
Legion::PhysicalRegion::is_mapped libregent.so 36.460s
This seems at odds with what you see from legion_prof - anything else I could look at to try to see what is causing this?
We need to see backtraces for those hot spots. Just seeing the lowest level function call is not helpful. Vtunes will know what the backtraces are but you'll probably need to dig around for them.
Backtrace for the top hotspot I think is (no line numbers as I'm not sure how to compile Regent with debug information and still optimised for performance - maybe the LEGION_BACKTRACE=1
would work but I'm not sure if that needs debug mode on)
std::_Rb_tree_increment,,66.886980,libstdc++.so.6,std::_Rb_tree_increment(std::_Rb_tree_node_base*),tree.cc,0xbbc30
Legion::Internal::TaskContext::check_region_dependence,63.927986,libregent.so,"Legion::Internal::TaskContext::check_region_dependence(unsigned int, Legion::IndexSpace, Legion::RegionRequirement const&, Legion::Internal::RegionUsage const&, Legion::RegionRequirement const&, bool) const",[Unknown],0x5a2fa0
Legion::Internal::TaskContext::find_conflicting_regions,0,libregent.so,"Legion::Internal::TaskContext::find_conflicting_regions(Legion::Internal::TaskOp*, std::vector<Legion::PhysicalRegion, std::allocator<Legion::PhysicalRegion>>&)",[Unknown],0x5a4db0
Legion::Internal::InnerContext::execute_task_launch,0,libregent.so,"Legion::Internal::InnerContext::execute_task_launch(Legion::Internal::TaskOp*, bool, Legion::Internal::LegionTrace*, bool, bool)",[Unknown],0x5adee0
Legion::Internal::InnerContext::execute_task,0,libregent.so,"Legion::Internal::InnerContext::execute_task(Legion::TaskLauncher const&, std::vector<Legion::OutputRequirement, std::allocator<Legion::OutputRequirement>>*)",[Unknown],0x5ae180
Legion::Internal::Runtime::execute_task,0,libregent.so,"Legion::Internal::Runtime::execute_task(Legion::Internal::TaskContext*, Legion::TaskLauncher const&, std::vector<Legion::OutputRequirement, std::allocator<Legion::OutputRequirement>>*)",[Unknown],0x7f91a0
Legion::Runtime::execute_task,0,libregent.so,"Legion::Runtime::execute_task(Legion::Internal::TaskContext*, Legion::TaskLauncher const&, std::vector<Legion::OutputRequirement, std::allocator<Legion::OutputRequirement>>*)",[Unknown],0x48fa20
legion_task_launcher_execute,0,libregent.so,legion_task_launcher_execute,[Unknown],0x4a8440
$<main>,0,test_wcsph2.exe,$<main>,[Unknown],0x422eb0
$__regent_task_10015_primary,0,test_wcsph2.exe,$__regent_task_10015_primary,[Unknown],0x422e50
Realm::LocalTaskProcessor::execute_task,0,libregent.so,"Realm::LocalTaskProcessor::execute_task(unsigned int, Realm::ByteArrayRef const&)",[Unknown],0xec2ca0
Realm::Task::execute_on_processor,0,libregent.so,Realm::Task::execute_on_processor(Realm::Processor),[Unknown],0xd90870
Realm::UserThreadTaskScheduler::execute_task,0,libregent.so,Realm::UserThreadTaskScheduler::execute_task(Realm::Task*),[Unknown],0xd90b60
Realm::ThreadedTaskScheduler::scheduler_loop,0,libregent.so,Realm::ThreadedTaskScheduler::scheduler_loop(void),[Unknown],0xd92e80
Realm::UserThread::uthread_entry,0,libregent.so,Realm::UserThread::uthread_entry(void),[Unknown],0xd78bd0
This looks like it matches up with how legion prof says its in the main task - I'm not sure what its doing though as the only difference is the code that attaches/acquires/launches the copy task.
Other hotspots that I found:
"std::vector<long long, std::allocator<long long>>::_M_realloc_insert<long long const&>",,58.082531,libregent.so,"void std::vector<long long, std::allocator<long long>>::_M_realloc_insert<long long const&>(__gnu_cxx::__normal_iterator<long long*, std::vector<long long, std::allocator<long long>>>, long long const&&&)",[Unknown],0x7bfc10
,Legion::Internal::RegionTreeForest::compute_index_path,57.972530,libregent.so,"Legion::Internal::RegionTreeForest::compute_index_path(Legion::IndexSpace, Legion::IndexSpace, std::vector<long long, std::allocator<long long>>&)",[Unknown],0x781d10
,Legion::Internal::TaskContext::check_region_dependence,0,libregent.so,"Legion::Internal::TaskContext::check_region_dependence(unsigned int, Legion::IndexSpace, Legion::RegionRequirement const&, Legion::Internal::RegionUsage const&, Legion::RegionRequirement const&, bool) const",[Unknown],0x5a2fa0
,Legion::Internal::TaskContext::find_conflicting_regions,0,libregent.so,"Legion::Internal::TaskContext::find_conflicting_regions(Legion::Internal::TaskOp*, std::vector<Legion::PhysicalRegion, std::allocator<Legion::PhysicalRegion>>&)",[Unknown],0x5a4db0
,Legion::Internal::InnerContext::execute_task_launch,0,libregent.so,"Legion::Internal::InnerContext::execute_task_launch(Legion::Internal::TaskOp*, bool, Legion::Internal::LegionTrace*, bool, bool)",[Unknown],0x5adee0
,Legion::Internal::InnerContext::execute_task,0,libregent.so,"Legion::Internal::InnerContext::execute_task(Legion::TaskLauncher const&, std::vector<Legion::OutputRequirement, std::allocator<Legion::OutputRequirement>>*)",[Unknown],0x5ae180
,Legion::Internal::Runtime::execute_task,0,libregent.so,"Legion::Internal::Runtime::execute_task(Legion::Internal::TaskContext*, Legion::TaskLauncher const&, std::vector<Legion::OutputRequirement, std::allocator<Legion::OutputRequirement>>*)",[Unknown],0x7f91a0
,Legion::Runtime::execute_task,0,libregent.so,"Legion::Runtime::execute_task(Legion::Internal::TaskContext*, Legion::TaskLauncher const&, std::vector<Legion::OutputRequirement, std::allocator<Legion::OutputRequirement>>*)",[Unknown],0x48fa20
,legion_task_launcher_execute,0,libregent.so,legion_task_launcher_execute,[Unknown],0x4a8440
,$<main>,0,test_wcsph2.exe,$<main>,[Unknown],0x422eb0
,$__regent_task_10015_primary,0,test_wcsph2.exe,$__regent_task_10015_primary,[Unknown],0x422e50
,Realm::LocalTaskProcessor::execute_task,0,libregent.so,"Realm::LocalTaskProcessor::execute_task(unsigned int, Realm::ByteArrayRef const&)",[Unknown],0xec2ca0
,Realm::Task::execute_on_processor,0,libregent.so,Realm::Task::execute_on_processor(Realm::Processor),[Unknown],0xd90870
,Realm::UserThreadTaskScheduler::execute_task,0,libregent.so,Realm::UserThreadTaskScheduler::execute_task(Realm::Task*),[Unknown],0xd90b60
,Realm::ThreadedTaskScheduler::scheduler_loop,0,libregent.so,Realm::ThreadedTaskScheduler::scheduler_loop(void),[Unknown],0xd92e80
,Realm::UserThread::uthread_entry,0,libregent.so,Realm::UserThread::uthread_entry(void),[Unknown],0xd78bd0
_int_free,,47.759463,libc.so.6,_int_free,[Unknown],0x80f70
,Legion::Internal::TaskContext::check_region_dependence,22.462184,libregent.so,"Legion::Internal::TaskContext::check_region_dependence(unsigned int, Legion::IndexSpace, Legion::RegionRequirement const&, Legion::Internal::RegionUsage const&, Legion::RegionRequirement const&, bool) const",[Unknown],0x5a2fa0
,Legion::Internal::TaskContext::find_conflicting_regions,0,libregent.so,"Legion::Internal::TaskContext::find_conflicting_regions(Legion::Internal::TaskOp*, std::vector<Legion::PhysicalRegion, std::allocator<Legion::PhysicalRegion>>&)",[Unknown],0x5a4db0
,Legion::Internal::InnerContext::execute_task_launch,0,libregent.so,"Legion::Internal::InnerContext::execute_task_launch(Legion::Internal::TaskOp*, bool, Legion::Internal::LegionTrace*, bool, bool)",[Unknown],0x5adee0
,Legion::Internal::InnerContext::execute_task,0,libregent.so,"Legion::Internal::InnerContext::execute_task(Legion::TaskLauncher const&, std::vector<Legion::OutputRequirement, std::allocator<Legion::OutputRequirement>>*)",[Unknown],0x5ae180
,Legion::Internal::Runtime::execute_task,0,libregent.so,"Legion::Internal::Runtime::execute_task(Legion::Internal::TaskContext*, Legion::TaskLauncher const&, std::vector<Legion::OutputRequirement, std::allocator<Legion::OutputRequirement>>*)",[Unknown],0x7f91a0
,Legion::Runtime::execute_task,0,libregent.so,"Legion::Runtime::execute_task(Legion::Internal::TaskContext*, Legion::TaskLauncher const&, std::vector<Legion::OutputRequirement, std::allocator<Legion::OutputRequirement>>*)",[Unknown],0x48fa20
,legion_task_launcher_execute,0,libregent.so,legion_task_launcher_execute,[Unknown],0x4a8440
,$<main>,0,test_wcsph2.exe,$<main>,[Unknown],0x422eb0
,$__regent_task_10015_primary,0,test_wcsph2.exe,$__regent_task_10015_primary,[Unknown],0x422e50
,Realm::LocalTaskProcessor::execute_task,0,libregent.so,"Realm::LocalTaskProcessor::execute_task(unsigned int, Realm::ByteArrayRef const&)",[Unknown],0xec2ca0
,Realm::Task::execute_on_processor,0,libregent.so,Realm::Task::execute_on_processor(Realm::Processor),[Unknown],0xd90870
,Realm::UserThreadTaskScheduler::execute_task,0,libregent.so,Realm::UserThreadTaskScheduler::execute_task(Realm::Task*),[Unknown],0xd90b60
,Realm::ThreadedTaskScheduler::scheduler_loop,0,libregent.so,Realm::ThreadedTaskScheduler::scheduler_loop(void),[Unknown],0xd92e80
,Realm::UserThread::uthread_entry,0,libregent.so,Realm::UserThread::uthread_entry(void),[Unknown],0xd78bd0
Legion::PhysicalRegion::is_mapped,,36.460231,libregent.so,Legion::PhysicalRegion::is_mapped(void) const,[Unknown],0x48b070
,Legion::Internal::TaskContext::find_conflicting_regions,36.200255,libregent.so,"Legion::Internal::TaskContext::find_conflicting_regions(Legion::Internal::TaskOp*, std::vector<Legion::PhysicalRegion, std::allocator<Legion::PhysicalRegion>>&)",[Unknown],0x5a4db0
,Legion::Internal::InnerContext::execute_task_launch,0,libregent.so,"Legion::Internal::InnerContext::execute_task_launch(Legion::Internal::TaskOp*, bool, Legion::Internal::LegionTrace*, bool, bool)",[Unknown],0x5adee0
,Legion::Internal::InnerContext::execute_task,0,libregent.so,"Legion::Internal::InnerContext::execute_task(Legion::TaskLauncher const&, std::vector<Legion::OutputRequirement, std::allocator<Legion::OutputRequirement>>*)",[Unknown],0x5ae180
,Legion::Internal::Runtime::execute_task,0,libregent.so,"Legion::Internal::Runtime::execute_task(Legion::Internal::TaskContext*, Legion::TaskLauncher const&, std::vector<Legion::OutputRequirement, std::allocator<Legion::OutputRequirement>>*)",[Unknown],0x7f91a0
,Legion::Runtime::execute_task,0,libregent.so,"Legion::Runtime::execute_task(Legion::Internal::TaskContext*, Legion::TaskLauncher const&, std::vector<Legion::OutputRequirement, std::allocator<Legion::OutputRequirement>>*)",[Unknown],0x48fa20
,legion_task_launcher_execute,0,libregent.so,legion_task_launcher_execute,[Unknown],0x4a8440
,$<main>,0,test_wcsph2.exe,$<main>,[Unknown],0x422eb0
,$__regent_task_10015_primary,0,test_wcsph2.exe,$__regent_task_10015_primary,[Unknown],0x422e50
,Realm::LocalTaskProcessor::execute_task,0,libregent.so,"Realm::LocalTaskProcessor::execute_task(unsigned int, Realm::ByteArrayRef const&)",[Unknown],0xec2ca0
,Realm::Task::execute_on_processor,0,libregent.so,Realm::Task::execute_on_processor(Realm::Processor),[Unknown],0xd90870
,Realm::UserThreadTaskScheduler::execute_task,0,libregent.so,Realm::UserThreadTaskScheduler::execute_task(Realm::Task*),[Unknown],0xd90b60
,Realm::ThreadedTaskScheduler::scheduler_loop,0,libregent.so,Realm::ThreadedTaskScheduler::scheduler_loop(void),[Unknown],0xd92e80
,Realm::UserThread::uthread_entry,0,libregent.so,Realm::UserThread::uthread_entry(void),[Unknown],0xd78bd0
If these are the hotspots, it suggests you are making a truly huge number of regions in this application. How many regions do you think are being made? If you are willing to assert that the runtime does not need to check the safety of each operation launch with respect to these regions, then you can run with -lg:unsafe_launch
flag. If you've messed up in your application at all through then you will get deadlocks.
I'll have to recheck but I think it should be creating 4 regions, or 9 regions if attach/release creates regions. The difference between the two versions should be +1 region for the inbuilt HDF5 version. I'll double check the code tomorrow but that is my current memory.
It's making way more than that (or many fields, hard to say) if we're spending most of our time in that particular piece of code. Please generate a Legion Spy log of a shorter run with -lg:spy
.
The spy logs are here https://drive.google.com/drive/folders/1tU7x8M21HkPngbUGiLUQ4V3OSuRMgCJL?usp=sharing
I'm currently running spy with -dezk
on them to see what output I get but the event graphs are taking a long time to build (at least for the inbuilt version)
(This is 2 timesteps with file IO at the end of both, so about as small as I can run)
Please make a new Legion Prof performance profile with -lg:unsafe_launch
.
I'm recounting regions - I believe I create: particle_array config io_region padded_particle_array (and delete particle_array) TradeQueueRegion
so exactly 5 regions should be created. The loop section of the code shouldn't create any additional regions.
It seems like -lg:unsafe_launch
hangs (never doing output) if i return to the original profile strategy of output every 0.001s sim time, but if i ask for output every step is does complete 5 IO steps correctly.
Output with an initial IO step (and printing at the end of each step:
Timestep = 0.000059
Time = 0.000059, runtime = 0, step_count = 0 <<<< IO HERE
1
2
IO every step (no hang, seems to run to completion):
Time = 0.000059, runtime = 0, step_count = 0
1
Time = 0.000117, runtime = 0, step_count = 1
2
Time = 0.000176, runtime = 1, step_count = 2
3
Time = 0.000235, runtime = 1, step_count = 3
4
Time = 0.000293, runtime = 1, step_count = 4
5
I can get a profile of this if it would be useful but I think its a lot shorter than the previous profiles.
The non inbuilt HDF5 version runs fine with unsafe_launch.
The only different task between the two is the copy_task
which is used with the attach/release version to copy data into the HDF5 region.
The -lg:unsafe_launch
is exactly what it says: it is unsafe; it removes the protections for detecting when the runtime needs to unmap and remap physical regions that are currently mapped in the executing task to prevent a dependence between the parent task and a child task being launched, which will exhibit as a hang. If you're hanging with -lg:unsafe_launch
it means that the runtime is having to unmap and remap for you around these sub-task launches. That is going to be a major major performance bug, likely big enough to account for the performance discrepancy that you are observing. You (or the Regent compiler if this program is written in Regent) need to do a better job of managing the mapping and unmapping of regions, so that regions are unmapped for long periods of time while you launch sub-tasks and only remapping them when you absolutely have to.
Ok - since it only appears to hang (or have performance degredation) when I use attach/release etc. to do file output (and not when I use HDF5 directly) I assume its related somewhat to mapping the region used for HDF5? Or I'm initialising it incorrectly/strangely (though I followed the example setup on Regent's website pretty much)
This program is written in Regent - is there a subset of mapper calls that would be easiest to look into to try to find out whats happening/try to avoid the problem? I've not really had much success wrapping my head around custom mappers yet.
This the mapping and unmapping of PhysicalRegion
objects in Legion which are controlled by the application, in this case Regent. You can't control them with the mapper. I think you and @elliottslaughter should get to the bottom of why the Regent compiler is doing such a poor job in this case.
Regent should be optimal with respect to mapping/unmapping unless you have a statement with (a) uses a region locally, and (b) passes it to a task or another remote operation, (c) at the same time. This is unusual in practice, so I'd be a bit surprised.
One thing you could do is attempt to mark tasks as __demand(__leaf)
or __demand(__inner)
. This will not change the behavior of the compiler. But it may teach you more about your code and what it is doing. If you're going back and forth with repeatedly mapping an instance, that could indicate a non-leaf, non-inner task, which is generally speaking not a recommended pattern.
If you were able to make all tasks __leaf
or __inner
, then this would be guaranteed not to happen.
How does regent deal with PhysicalRegion
objects that have been mapped as part of attach operations?
I think we just store them in a variable and look it back up on detach. This fairly hacky and would be relatively easy to mess up if the application were doing something complicated.
If I remember correctly, I think the use of attach/detach with a non-inner task is officially unsupported in Regent for this reason.
Ah, ok - I think currently the code has the attach/detach in the main task, and then launches the copy task which I've not marked as __leaf
. Is the main "non-inner"? I'll try marking the copy task as a leaf task on Monday and see if the issue persists.
The issue is more likely to be in the main task than in anything it calls, assuming main is the only one that uses attach/detach.
I'd start by trying to mark main as __inner
and see what that says.
(Using __leaf
for other tasks is good in general, but is less likely to be related to this particular issue.)
Ok - I started by trying to mark as __inner
, but this was disallowed because I was setting the initial data - which I then moved into a separate task - but other of the setup code inside metaprogrammed sections still disallowed the main task being inner.
I then tried moving the timestep code into its own task, and marking that as inner, but am having a few issues due to not passing the cell partition through to the task, which means the code that generates all the tasks don't work. I'll look into that now.
Its possible that c is happening, though other than the initalisation I believe all the accesses to the data should be from other tasks (though I need to double check). I could also switch the code that does the I/O to be a task (as opposed to the current quote), but my difficulty is generating tasks with __demand
, i.e.
function generate_task()
--Cannot use __demand with local
local task generated_task()
--some generated task code
end
return generated_task
end
is not allowed - I assume I can maybe do something with the resulting generated_task table/object but I've not worked it out yet. I have the same issue with generated tasks being leaf tasks since I can't use __demand
to do this.
Ok, I managed to cut some stuff from the code and use an inner task and this fixed the problem, so the only remaining issue I have is how I can generate an inner task with metaprogramming - i think with that then it would all be fine.
The syntax is a bit odd, but you should be able to do:
local __demand(__inner) task ...
We don't get to control the use of local
so unfortunately we're forced to put all Regent keywords after that point.
That seems to work - the only remaining thing to implement it correctly I can't work out is if there's a way to pass a string into a task. I can't use rawstring
since its just a pointer under the hood - is the only option to make a region and set that each time? i.e.
fspace string_array{
a_char : char
}
...
var string_reg = region(ispace(int1d, 1024), string_array)
var index : int = 0
while filename[index] != '\0' do
string_reg[int1d(index)] = filename[index]
index = index + 1
end
and then reconstruct the string in the inner task to attach/detach? Is there an easier way to put a string in a region?
Edit: Ah, but accessing that region prevents the task being inner...any better ideas? Edit 2: Just found regentlib.string which I'll try.
Edit 3: All seems to be working now
I think this is all fixed (so issue can be closed). Suggestion for the regent website would be to add a comment saying to only use attach/detach in inner tasks.
Thanks for letting us know.
I added a new warning under the "Attach and Detach" section of the reference: http://regent-lang.org/reference/ . (Unfortunately header links are broken at the moment so I can't give you a direct link, but you can search for the header.)
I'm not sure if this is simply due to me creating/deleting regions and this being overly expensive, so if so apologies, but I don't believe it does.
TLDR: Using attach/acquire/release/detach is much slower than using HDF5 directly.
I have 2 implementations of using HDF5 through regent (with some metaprogramming). The first I created does file creation using HDF5 calls, and then:
The
hdf5_io_space
field space is created in the code and only contains the required output fields.The second implementation creates the same files, but I'm using HDF5 calls directly (with metaprogramming), and creating/freeing the arrays to handle the data copies.
This is then built into a simple test code that does very simple Smoothed Particle Hydrodynamics testcase. Files are output approximately every 0.001 sim time (~4-5s real time with no file I/O, every 17 timesteps).
The manually written HDF5 code matches this, with outputs taking place every 4-5s:
The Regent library HDF5 code (with the new region creation/deletion) takes significantly longer, and notably takes longer and longer over time it seems:
Notably I also tried creating this region initially and reusing it (instead of keeping creating it), so I just had:
without the region creation/deletion and still had this timing profile:
so the region creation/deletion isn't the main cause of the slowdown. Anything I can check to see what is causing this?