Open LonelyCat124 opened 4 years ago
I'll work on the profile for the test code now - even that generates 22M objects for a run though. I'l try trimming if thats still not possible.
The partition-by-field operation itself (where you have to actually read the values of a region's fields) is more heavyweight than the equal-partition operation (where you just draw up the sub-region boundaries using index math), so the slowdown makes some sense there. However, partitioning should be an infrequent operation. Is this partitioning happening on each pass through this code, and is that code executed repeatedly?
For both the real and test codes the partition happens once (for the real code Regent's type checking prevents partitions being assigned to the same variable)
Shouldn't Future start = runtime->get_current_time_in_microseconds(ctx); come right after the first fence?
The current code I'm running has fence start fence code fence end to ensure timing is accurate (or as accurate as I can work out how to do).
Ok, breaking down the trace, this is 15s->20s in (overall timed part of the run was 45s or something around there)
I zoomed in and all the tasks in CPU appeared to be prof tasks - I tried to expand the util section of the trace but it wouldn't display.
Another trace from 35s->end The block of work tasks is the 1000 self tasks which each take 1.5-2.5ms. Before those there are the timestep tasks, which should be lighter versions of those tasks (0.15ms) - those I expect to be a bit light, but
Looking at the util stuff here as it loads:
The giant 50000 blocks of tasks are Defer Task Perform Mapping tasks (and the occasional Defer Task Launch), each with a runtime of ~20us:
For both the real and test codes the partition happens once (for the real code Regent's type checking prevents partitions being assigned to the same variable)
If the partitioning only happens once per app execution then a difference of a few seconds shouldn't be a big deal.
We will need to see the full profiles rather than screenshots to be able to give an accurate diagnosis. It is surprising to see the application processors sitting idle while the runtime is working, normally app and runtime should be working concurrently to some extent. You application might possibly benefit from enabling tracing.
Popping up from this discussion, it sounds like we have veered off from the original topic of this bug report. It would be best to continue this discussion on slack or move it to an appropriate issue, and close this issue if the original problem has been fixed.
Yeah - I think we have. I think the original bug was fixed (or at least there's a patch in here that partly resolves it), but a second bug (crash) due to memory leaks was also in this thread - that bug is still replicated in commit e2aac08 in https://github.com/stfc/RegentParticleDSL
Instructions to replicate are https://github.com/StanfordLegion/legion/issues/929#issuecomment-698252787
I have a timestep-based code in Regent, which on my current testcase of interest hangs after 5 timesteps, but appears not to do so on a smaller testcase.
I've tried:
I can't remember if I ran with
-lg:partcheck
yet so I've set that now.I pulled two stack traces from gdb at the point the code seemed to be frozen (around 12 hours apart). Most threads appear to be in
Realm::condvar::Wait
, with what seems to be a couple of threads in acquiring and AutoLock and another thread inRealm::DyanmicTable<Realm::DynamicTableAllocator<Realm::SparsityMapImplWrapper, 10ul, 4ul> >::lookup_entry
I had one run crash immediately with:
but have not been able to replicate that error so have no idea of the cause.
Is there anything else I should try to work out what can be causing this? The smaller testcase doesn't necessarily produce that many fewer tasks, but should large numbers of tasks cause potential hangs (sounds unlikely)?