StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
692 stars 144 forks source link

Regent: Code hang #929

Open LonelyCat124 opened 4 years ago

LonelyCat124 commented 4 years ago

I have a timestep-based code in Regent, which on my current testcase of interest hangs after 5 timesteps, but appears not to do so on a smaller testcase.

I've tried:

  1. Compiling with debug mode - no difference/errors
  2. Running the smaller testcase with -fbounds-check (no errors)
  3. Running with freeze on error
  4. Running with in order execution

I can't remember if I ran with -lg:partcheck yet so I've set that now.

I pulled two stack traces from gdb at the point the code seemed to be frozen (around 12 hours apart). Most threads appear to be in Realm::condvar::Wait, with what seems to be a couple of threads in acquiring and AutoLock and another thread in Realm::DyanmicTable<Realm::DynamicTableAllocator<Realm::SparsityMapImplWrapper, 10ul, 4ul> >::lookup_entry

I had one run crash immediately with:

terra: /netfs/smain01/scafellpike/local/HCH028/mjm02/axc67-mjm02/ECP/PSycloneBench/benchmarks/nemo/nemolite2d/manual_versions/regent/legion/runtime/legion/region_tree.cc:16345: void Legion::Internal::RegionNode::add_child(Legion::Internal::PartitionNode*): Assertion `color_map.find(child->row_source->color) == color_map.end()' failed

but have not been able to replicate that error so have no idea of the cause.

Is there anything else I should try to work out what can be causing this? The smaller testcase doesn't necessarily produce that many fewer tasks, but should large numbers of tasks cause potential hangs (sounds unlikely)?

LonelyCat124 commented 4 years ago

I'll work on the profile for the test code now - even that generates 22M objects for a run though. I'l try trimming if thats still not possible.

The partition-by-field operation itself (where you have to actually read the values of a region's fields) is more heavyweight than the equal-partition operation (where you just draw up the sub-region boundaries using index math), so the slowdown makes some sense there. However, partitioning should be an infrequent operation. Is this partitioning happening on each pass through this code, and is that code executed repeatedly?

For both the real and test codes the partition happens once (for the real code Regent's type checking prevents partitions being assigned to the same variable)

Shouldn't Future start = runtime->get_current_time_in_microseconds(ctx); come right after the first fence?

The current code I'm running has fence start fence code fence end to ensure timing is accurate (or as accurate as I can work out how to do).

Ok, breaking down the trace, this is 15s->20s in (overall timed part of the run was 45s or something around there) image

I zoomed in and all the tasks in CPU appeared to be prof tasks - I tried to expand the util section of the trace but it wouldn't display.

Another trace from 35s->end image The block of work tasks is the 1000 self tasks which each take 1.5-2.5ms. Before those there are the timestep tasks, which should be lighter versions of those tasks (0.15ms) - those I expect to be a bit light, but

Looking at the util stuff here as it loads: image

The giant 50000 blocks of tasks are Defer Task Perform Mapping tasks (and the occasional Defer Task Launch), each with a runtime of ~20us: image

manopapad commented 4 years ago

For both the real and test codes the partition happens once (for the real code Regent's type checking prevents partitions being assigned to the same variable)

If the partitioning only happens once per app execution then a difference of a few seconds shouldn't be a big deal.

We will need to see the full profiles rather than screenshots to be able to give an accurate diagnosis. It is surprising to see the application processors sitting idle while the runtime is working, normally app and runtime should be working concurrently to some extent. You application might possibly benefit from enabling tracing.

Popping up from this discussion, it sounds like we have veered off from the original topic of this bug report. It would be best to continue this discussion on slack or move it to an appropriate issue, and close this issue if the original problem has been fixed.

LonelyCat124 commented 4 years ago

Yeah - I think we have. I think the original bug was fixed (or at least there's a patch in here that partly resolves it), but a second bug (crash) due to memory leaks was also in this thread - that bug is still replicated in commit e2aac08 in https://github.com/stfc/RegentParticleDSL

Instructions to replicate are https://github.com/StanfordLegion/legion/issues/929#issuecomment-698252787