StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
675 stars 145 forks source link

Hang when assigning multiple points per shard #1257

Closed bandokihiro closed 2 years ago

bandokihiro commented 2 years ago

I have a case on 2 nodes with index launches of size 4. My sharding functor is set such that points 0 and 1 go to shard 0, and points 2 and 3 go to shard 1. With -lg:partcheck, I get a hang with the following backtrace gdb_1.txt

Without -lg:partcheck, it hangs later in the code with the following backtrace gdb_2.txt

This case succeeds without errors with 1 shard and -lg:partcheck.

lightsighter commented 2 years ago

Run with -lg:inorder -lg:safe_ctrlrepl 1 on sapling and send me frozen process numbers.

bandokihiro commented 2 years ago

38661 on g0002 and 3291643 on g0003 (there are some residual processes on the latter)

lightsighter commented 2 years ago

How are you making the second partition of the iface_2D index space?

bandokihiro commented 2 years ago

I think you are referring to this one

std::map<DomainPoint, TaskArgument> data; // only one entry will be given
const DomainPoint p(runtime->get_shard_id(ctx, true));
if (p[0] < 2) {
    data.insert(
        std::pair<DomainPoint, TaskArgument>(
            p, TaskArgument(&domains[p], sizeof(Domain))));
}
const FutureMap fm = runtime->construct_future_map(
    ctx, priv_vs_shar_color_space_name, data,
    true /*collective*/, 0, true /*implicit sharding functor*/);
const IndexPartition new_ip = runtime->create_partition_by_domain(
    ctx, is, fm, priv_vs_shar_color_space_name, true, kinds[i]);
priv_vs_shared_iface_2D = runtime->get_logical_partition(
    ctx, iface_2D_lr, new_ip);

This is a partition with 2 sub-regions. Each sub-region is sparse since faces are ordered like "priv(0) shar(0) priv(1) shar(1) ...". domains[0] is the (extended 2D) domain of private faces and domains[1] the same for shared faces.

bandokihiro commented 2 years ago

I think this is an app bug. What you pointed made me realize that some portions of the code implictly assumes that the color space has the same size as the number of shards which is not the case. I'll fix this first. I would be interested though if there is a better way of doing this without calling get_shard_id, this is the only way I found to pass safe_ctrlrepl. The common pattern I have is that I want to extend a 1D region into a 2D one only partitioned across its first dimension.

bandokihiro commented 2 years ago

I have fixed my code, and I now get this

log_0.log:[0 - 7f4d18036000]    1.647351 {3}{DG}: point <0> domain <1554,0>..<1592,2> rects: <1554,0>..<1592,2>
log_0.log:[0 - 7f4d18036000]    1.647369 {3}{DG}: point <1> domain <3174,0>..<3174,2> rects: <3174,0>..<3174,2>
log_0.log:[0 - 7f4d18036000]    1.649195 {5}{runtime}: [error 71] LEGION ERROR: Call to partitioning function create_partition_by_domain in top_level_task (UID 6) specified partition was DISJOINT_KIND but the partition is aliased. (from file /home/bandokihiro/Builds_GPUNodes/legion/runtime/legion/legion_context.cc:15042)
log_1.log:[1 - 7f5e9c0cd000]    1.646955 {3}{DG}: point <2> domain <292,0>..<3199,2>,30001000100000d3 rects: <292,0>..<300,2> <1593,0>..<1598,2> <3175,0>..<3199,2>
log_1.log:[1 - 7f5e9c0cd000]    1.646979 {3}{DG}: point <3> domain <301,0>..<1960,2>,30001000100000d4 rects: <301,0>..<308,2> <1599,0>..<1658,2> <1949,0>..<1960,2>

The code looks like this

for (unsigned j = 0; j < points_per_shard; j++) {
    const Point<1> p(shard_id * points_per_shard + j);
    const Domain dom = domains[p];
    stringstream msg;
    msg << "point " << p << " domain " << dom << " rects: ";
    for (RectInDomainIterator<2> it(dom); it.valid(); it.step()) {
        msg << *it << " ";
    }
    main_logger.print() << msg.str();
    data.insert(std::pair<DomainPoint, TaskArgument>(
        p, TaskArgument(&dom, sizeof(Domain))));
}
const FutureMap fm = runtime->construct_future_map(
    ctx, metis_part_is, data,
    true /*collective*/, 0, true /*implicit sharding functor*/);
const IndexPartition new_ip = runtime->create_partition_by_domain(
    ctx, is, fm, metis_part_is, true, kinds[i]);

where kinds[i] is DISJOINT_KIND. I believe this is the case from the logs. When I use only one process, I don't use future maps and call directly create_partition_by_domain which succeeds without disjointness errors.

bandokihiro commented 2 years ago

I left hanging processes: 65852 on g0002 and 3310588 on g0003

lightsighter commented 2 years ago

There are definitely two index spaces in this partition that are overlapping (in fact they have the same rectangles), so this is an aliased partition:

(gdb) p $11->realm_index_space
$16 = {
  bounds = {
    lo = {
      x = 3174,
      y = 0
    },
    hi = {
      x = 3174,
      y = 2
    }
  },
  sparsity = {
    id = 0
  }
}
(gdb) p $13->realm_index_space
$17 = {
  bounds = {
    lo = {
      x = 3174,
      y = 0
    },
    hi = {
      x = 3174,
      y = 2
    }
  },
  sparsity = {
    id = 0
  }
}
(gdb) p $11->color
$18 = 0
(gdb) p $13->color
$19 = 1

this is the only way I found to pass safe_ctrlrepl.

Safe control replication checks know about whether the construction of a future map is collective or not and will not check the values of the futures if you're doing a collective future map construction. What exactly is the safe control replication check you are failing? It's likely to be an actual bug.

bandokihiro commented 2 years ago

I use the future map pattern when I want to extend a 1D partition to a 2D/3D one instead of the following

runtime->create_partition_by_domain(ctx, is, domains, color_space_name, true, DISJOINT_COMPLETE_KIND);

where domains is a map<DomainPoint, Domain> and each Domain in that map is constructed from a vector of rectangles like

Domain subdom3d = DomainT<3>(rects);
domains[*color_itr] = subdom3d;

This is for when the subdomain is sparse. I think the issue was that each domain has its own sparsity ID on each shard and control replication was violated.

I reverted back to this call (i.e. without future maps), safe control replication doesn't trigger an error but it hangs with the same kind of backtraces

#1  0x000055de4482934a in Realm::Doorbell::wait_slow (this=0x7f4770511090) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/mutex.cc:316
#2  0x000055de445fbd7a in Realm::Doorbell::wait (this=0x7f4770511090) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/mutex.inl:81
#3  0x000055de4482a84c in Realm::FIFOCondVar::wait (this=0x7f4770501520) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/mutex.cc:1088
#4  0x000055de44766a94 in Realm::KernelThreadTaskScheduler::worker_sleep (this=0x55e32b4bbe60, switch_to=0x7f3c84033ae0) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/tasks.cc:1403
#5  0x000055de44764a1b in Realm::ThreadedTaskScheduler::thread_blocking (this=0x55e32b4bbe60, thread=0x7f3c9001f9e0) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/tasks.cc:908
#6  0x000055de44611700 in Realm::Thread::wait_for_condition<Realm::EventTriggeredCondition> (cond=..., poisoned=@0x7f4770501b4f: false) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/threads.inl:218
#7  0x000055de445ff4da in Realm::Event::wait_faultaware (this=0x7f3cb007b778, poisoned=@0x7f4770501b4f: false) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/event_impl.cc:254
#8  0x000055de445ff090 in Realm::Event::wait (this=0x7f3cb007b778) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/event_impl.cc:206
#9  0x000055de4349fc1a in Legion::Internal::LgEvent::wait (this=0x7f3cb007b778) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/legion/legion_types.h:2794
#10 0x000055de441f836f in Legion::Internal::BroadcastCollective::perform_collective_wait (this=0x7f3cb007b640, block=true) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/legion/legion_replication.cc:10592
#11 0x000055de437e1fcd in Legion::Internal::ValueBroadcast<bool>::get_value (this=0x7f3cb007b640, wait=true) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/legion/legion_replication.h:384
#12 0x000055de4379faee in Legion::Internal::IndexPartNode::compute_disjointness (this=0x7f3cb007c8b0, collective=0x7f3cb007b640, owner=false) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/legion/region_tree.cc:10883
#13 0x000055de437a104f in Legion::Internal::IndexPartNode::handle_disjointness_computation (args=0x7f3cb007b380, forest=0x55e32b4ddd30) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/legion/region_tree.cc:11229
#14 0x000055de438cbc09 in Legion::Internal::Runtime::legion_runtime_task (args=0x7f3cb007b380, arglen=28, userdata=0x55e32d8e9360, userlen=8, p=...) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/legion/runtime.cc:31173
#15 0x000055de446fb45a in Realm::LocalTaskProcessor::execute_task (this=0x55e32b4bbb20, func_id=4, task_args=...) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/proc_impl.cc:1135
#16 0x000055de4476283d in Realm::Task::execute_on_processor (this=0x7f3cb007b200, p=...) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/tasks.cc:302
#17 0x000055de44766898 in Realm::KernelThreadTaskScheduler::execute_task (this=0x55e32b4bbe60, task=0x7f3cb007b200) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/tasks.cc:1355
#18 0x000055de4476561e in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x55e32b4bbe60) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/tasks.cc:1094
#19 0x000055de44765c2d in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x55e32b4bbe60) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/tasks.cc:1206
#20 0x000055de4476ddc2 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x55e32b4bbe60) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/threads.inl:97
#21 0x000055de4477b09e in Realm::KernelThread::pthread_entry (data=0x7f3c9001f9e0) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/threads.cc:774
#22 0x00007f4793500609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#23 0x00007f478a713293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
#1  0x000055de4482934a in Realm::Doorbell::wait_slow (this=0x7f477117d090) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/mutex.cc:316
#2  0x000055de445fbd7a in Realm::Doorbell::wait (this=0x7f477117d090) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/mutex.inl:81
#3  0x000055de4482a84c in Realm::FIFOCondVar::wait (this=0x7f477116b310) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/mutex.cc:1088
#4  0x000055de44766a94 in Realm::KernelThreadTaskScheduler::worker_sleep (this=0x55e32b4bc3c0, switch_to=0x7f3cb004e800) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/tasks.cc:1403
#5  0x000055de4476498b in Realm::ThreadedTaskScheduler::thread_blocking (this=0x55e32b4bc3c0, thread=0x55e32d8edd00) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/tasks.cc:896
#6  0x000055de44611700 in Realm::Thread::wait_for_condition<Realm::EventTriggeredCondition> (cond=..., poisoned=@0x7f477116b93f: false) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/threads.inl:218
#7  0x000055de445ff4da in Realm::Event::wait_faultaware (this=0x7f3cb007d4c0, poisoned=@0x7f477116b93f: false) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/event_impl.cc:254
#8  0x000055de445ff090 in Realm::Event::wait (this=0x7f3cb007d4c0) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/event_impl.cc:206
#9  0x000055de4349fc1a in Legion::Internal::LgEvent::wait (this=0x7f3cb007d4c0) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/legion/legion_types.h:2794
#10 0x000055de43a941a7 in Legion::Internal::IndexSpaceNodeT<2, long long>::get_realm_index_space (this=0x7f3cb007cfd0, result=..., need_tight_result=true) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/legion/region_tree.inl:2559
#11 0x000055de43bfaec9 in Legion::Internal::IndexSpaceNodeT<2, long long>::get_volume (this=0x7f3cb007cfd0) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/legion/region_tree.inl:3195
#12 0x000055de43bf90c9 in Legion::Internal::IndexSpaceNodeT<2, long long>::check_empty (this=0x7f3cb007cfd0) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/legion/region_tree.inl:2777
#13 0x000055de43537cce in Legion::Internal::IndexSpaceExpression::is_empty (this=0x7f3cb007d330) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/legion/region_tree.h:1347
#14 0x000055de4378e6e8 in Legion::Internal::RegionTreeForest::subtract_index_spaces (this=0x55e32b4ddd30, lhs=0x7f3cb007d330, rhs=0x7f3cb0079800, creator=0x0, mutator=0x0) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/legion/region_tree.cc:6795
#15 0x000055de44112887 in Legion::Internal::ReplicateContext::verify_partition (this=0x7f3c94073c00, pid=..., kind=LEGION_DISJOINT_COMPLETE_KIND, function_name=0x55de4554d732 "create_partition_by_domain") at /home/bandokihiro/Builds_GPUNodes/legion/runtime/legion/legion_context.cc:14859
#16 0x000055de4410df21 in Legion::Internal::ReplicateContext::create_partition_by_domain (this=0x7f3c94073c00, parent=..., domains=..., color_space=..., perform_intersections=true, part_kind=LEGION_COMPUTE_KIND, color=4294967295, skip_check=true) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/legion/legion_context.cc:14093
#17 0x000055de4410d883 in Legion::Internal::ReplicateContext::create_partition_by_domain (this=0x7f3c94073c00, parent=..., domains=std::map with 2 elements = {...}, color_space=..., perform_intersections=true, part_kind=LEGION_DISJOINT_COMPLETE_KIND, color=4294967295) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/legion/legion_context.cc:14033
#18 0x000055de43493f97 in Legion::Runtime::create_partition_by_domain (this=0x55de48b27820, ctx=0x7f3c94073c00, parent=..., domains=std::map with 2 elements = {...}, color_space=..., perform_intersections=true, part_kind=LEGION_DISJOINT_COMPLETE_KIND, color=4294967295) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/legion/legion.cc:4771

These backtraces were obtained for a simple one sub-region per rank case which didn't hang with the future map pattern.

bandokihiro commented 2 years ago

I left hanging process on g0001: 243559, 243561, 243562

lightsighter commented 2 years ago

This is for when the subdomain is sparse. I think the issue was that each domain has its own sparsity ID on each shard and control replication was violated.

As long as you set collective=true then that won't be the case. The runtime only checks the values of the domains if collective=false.

I left hanging process on g0001: 243559, 243561, 243562

They were gone when I checked tonight.

safe control replication doesn't trigger an error but it hangs with the same kind of backtraces

Assuming these are the same backtraces as before, then the problem was that you weren't setting one of the colors in the colors space at all by any shard.

bandokihiro commented 2 years ago

Thanks, I ended up removing the use of future maps. create_partition_by_domain doesn't trigger safe control replication checks anymore even if every shard provides its own full map<DomainPoint,Domain> with different sparsity IDs. Returning to the initial issue, it doesn't hang anymore at a partition check and I think I have a new hang with the following backtrace

#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x000056056255e3fa in Realm::Doorbell::wait_slow (this=0x7fc7779c8090) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/mutex.cc:316
#2  0x0000560562330e2a in Realm::Doorbell::wait (this=0x7fc7779c8090) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/mutex.inl:81
#3  0x000056056255f8fc in Realm::FIFOCondVar::wait (this=0x7fc7779b5e00) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/mutex.cc:1088
#4  0x000056056249bb44 in Realm::KernelThreadTaskScheduler::worker_sleep (this=0x560a4898f8a0, switch_to=0x7fc708003450) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/tasks.cc:1403
#5  0x0000560562499a3b in Realm::ThreadedTaskScheduler::thread_blocking (this=0x560a4898f8a0, thread=0x7fc710002960) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/tasks.cc:896
#6  0x00005605623467b0 in Realm::Thread::wait_for_condition<Realm::EventTriggeredCondition> (cond=..., poisoned=@0x7fc7779b642f: false) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/threads.inl:218
#7  0x000056056233458a in Realm::Event::wait_faultaware (this=0x7fc7779b6718, poisoned=@0x7fc7779b642f: false) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/event_impl.cc:254
#8  0x0000560562334140 in Realm::Event::wait (this=0x7fc7779b6718) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/event_impl.cc:206
#9  0x00005605611d4c54 in Legion::Internal::LgEvent::wait (this=0x7fc7779b6718) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/legion/legion_types.h:2781
#10 0x00005605614b9b35 in Legion::Internal::RegionTreeForest::get_node (this=0x560a48a12010, part=..., defer=0x0, can_fail=false, first=true, local_only=false) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/legion/region_tree.cc:4685
#11 0x00005605614cc393 in Legion::Internal::IndexSpaceNode::get_child (this=0x7fc72003e5f0, c=17, defer=0x0, can_fail=false) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/legion/region_tree.cc:8930
#12 0x00005605614a8511 in Legion::Internal::RegionTreeForest::get_logical_partition_by_color (this=0x560a48a12010, parent=..., c=17) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/legion/region_tree.cc:1557
#13 0x00005605615d129a in Legion::Internal::Runtime::get_logical_partition_by_color (this=0x560a48acd010, par=..., c=17) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/legion/runtime.cc:18316
#14 0x00005605611cd183 in Legion::Runtime::get_logical_partition_by_color (this=0x560565ff8cd0, parent=..., c=17) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/legion/legion.cc:5747
#15 0x0000560560d6e848 in PackingProjectionForDst::project (this=0x560565bfb720, upper_bound=..., point=..., launch_domain=...) at /home/bandokihiro/Builds_GPUNodes/DG-Legion/tasks/solution_tasks.h:58
#16 0x00005605615bb91c in Legion::Internal::ProjectionFunction::project_points (this=0x7fc744020a40, req=..., idx=2, runtime=0x560a48acd010, launch_domain=..., point_tasks=std::vector of length 6, capacity 6 = {...}) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/legion/runtime.cc:15003
#17 0x0000560561395e25 in Legion::Internal::SliceTask::enumerate_points (this=0x7fc71c02d7d0, inline_task=false) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/legion/legion_tasks.cc:11248
#18 0x0000560561393b8a in Legion::Internal::SliceTask::map_and_launch (this=0x7fc71c02d7d0) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/legion/legion_tasks.cc:10740
#19 0x000056056137dfc1 in Legion::Internal::MultiTask::trigger_mapping (this=0x7fc71c02d7d0) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/legion/legion_tasks.cc:5419
#20 0x0000560561600a6e in Legion::Internal::Runtime::legion_runtime_task (args=0x7fc71c0f6cf0, arglen=12, userdata=0x560a48997ed0, userlen=8, p=...) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/legion/runtime.cc:31098
#21 0x000056056243050a in Realm::LocalTaskProcessor::execute_task (this=0x560a4898f560, func_id=4, task_args=...) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/proc_impl.cc:1135
#22 0x00005605624978ed in Realm::Task::execute_on_processor (this=0x7fc71c0f6b70, p=...) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/tasks.cc:302
#23 0x000056056249b948 in Realm::KernelThreadTaskScheduler::execute_task (this=0x560a4898f8a0, task=0x7fc71c0f6b70) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/tasks.cc:1355
#24 0x000056056249a6ce in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x560a4898f8a0) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/tasks.cc:1094
#25 0x000056056249acdd in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x560a4898f8a0) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/tasks.cc:1206
#26 0x00005605624a2e72 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x560a4898f8a0) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/threads.inl:97
#27 0x00005605624b014e in Realm::KernelThread::pthread_entry (data=0x7fc710002960) at /home/bandokihiro/Builds_GPUNodes/legion/runtime/realm/threads.cc:774
#28 0x00007fcb93d07609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#29 0x00007fcb8af1a293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

This backtrace was obatined on g0003 with process 3339848. The other process is 197766 on g0004.

update: 126313 on g0002, 3377536 on g0003

bandokihiro commented 2 years ago

The projection functor looks like this

        Legion::LogicalRegion lr = runtime->get_logical_subregion_by_color(upper_bound, point[1]);
        Legion::LogicalPartition lp = runtime->get_logical_partition_by_color(
            lr, PID_LVL2_IFACE_BY_DEST);
        return runtime->get_logical_subregion_by_color(lp, point[0]);

For point task point, it requests sub-region point[1],point[0] of a 2-level partition.

lightsighter commented 2 years ago

This just looks like you are asking for a partition with a color that hasn't been made yet. Legion can't tell the difference between a partition that hasn't been made yet and one that will never be made because we allow users to manage their own color spaces for the names of partitions. In this particular case there is no partition with color 17 in this subspace. There are no semantic names anywhere in this index space tree, so I can't tell you the names of anything.

bandokihiro commented 2 years ago

There is definitely something I am misunderstanding. The following code in an index launch (to create a 2nd-level partition)

{
    stringstream msg;
    msg << "create_partition_by_domain at point " << task->index_point << " on " << sub_2d_is << " with color " << PID_LVL2_IFACE_BY_DEST << " with map ";
    for (auto it=domains.begin(); it!=domains.end(); it++) {
        msg << "(" << it->first << "," << it->second << ") ";
    }
    main_logger.print() << msg.str();
}

runtime->create_partition_by_domain(
    ctx, sub_2d_is, domains, args->color_space_name, true /*perform intersection*/,
    DISJOINT_COMPLETE_KIND, PID_LVL2_IFACE_BY_DEST);

{
    stringstream msg;
    std::set<Color> colors;
    runtime->get_index_space_partition_colors(ctx, sub_2d_is, colors);
    msg << "In index launch\tpoint:" << task->index_point << "; is:" << sub_2d_is << "; colors:";
    for (auto itr=colors.begin(); itr!=colors.end(); itr++) {
        msg << *itr << ",";
    }
    main_logger.print() << msg.str();
}

produces the following logging

bandokihiro@sapling:~/Runs/Vortex/03_TestAndCheck/2Nodes_Heterogeneous$ grep -i 'create_partition' logs/log_*
logs/log_0.log:[0 - 7f8a03bdb000]    2.086957 {3}{DG}: create_partition_by_domain at point (0) on IndexSpace(200,21) with color 17 with map ((0),<1,0>..<0,2>) ((1),<1,0>..<0,2>) ((2),<292,0>..<300,2>) ((3),<301,0>..<308,2>)
logs/log_0.log:[0 - 7f8a035a5000]    2.087618 {3}{DG}: create_partition_by_domain at point (1) on IndexSpace(197,21) with color 17 with map ((0),<1554,0>..<1592,2>) ((1),<1,0>..<0,2>) ((2),<1593,0>..<1598,2>) ((3),<1599,0>..<1658,2>)
logs/log_1.log:[1 - 7f7ced393000]    2.088001 {3}{DG}: create_partition_by_domain at point (3) on IndexSpace(199,21) with color 17 with map ((0),<1,0>..<0,2>) ((1),<3174,0>..<3174,2>) ((2),<3175,0>..<3199,2>) ((3),<1,0>..<0,2>)
logs/log_1.log:[1 - 7f7ced9c9000]    2.089698 {3}{DG}: create_partition_by_domain at point (2) on IndexSpace(202,21) with color 17 with map ((0),<1,0>..<0,2>) ((1),<1,0>..<0,2>) ((2),<1,0>..<0,2>) ((3),<1949,0>..<1960,2>)

bandokihiro@sapling:~/Runs/Vortex/03_TestAndCheck/2Nodes_Heterogeneous$ grep -i 'index launch' logs/log_*
logs/log_0.log:[0 - 7f8a03bdb000]    2.088508 {3}{DG}: In index launch  point:(0); is:IndexSpace(200,21); colors:17,
logs/log_0.log:[0 - 7f8a035a5000]    2.089806 {3}{DG}: In index launch  point:(1); is:IndexSpace(197,21); colors:
logs/log_1.log:[1 - 7f7ced393000]    2.090148 {3}{DG}: In index launch  point:(3); is:IndexSpace(199,21); colors:17,
logs/log_1.log:[1 - 7f7ced9c9000]    2.091549 {3}{DG}: In index launch  point:(2); is:IndexSpace(202,21); colors:

In the above, I expect all points to have color 17 and I don't understand why it is missing in point 1 and 2. This index launch is executed like

    runtime->execute_index_space(ctx, launcher).wait_all_results(true /*silence warnings*/);
    runtime->issue_execution_fence(ctx).get_void_result(true /*silence warnings*/);

Then, if I do the same kind of logging at the top-level, I get

logs/log_0.log:[0 - 7f00d0f6f000]    2.001604 {3}{DG}: In top-level     point:(0); ip:IndexPartition(68,21); is:IndexSpace(200,21); colors:17,
logs/log_0.log:[0 - 7f00d0f6f000]    2.001822 {3}{DG}: In top-level     point:(1); ip:IndexPartition(70,21); is:IndexSpace(197,21); colors:
logs/log_1.log:[1 - 7f25e1181000]    2.002301 {3}{DG}: In top-level     point:(0); ip:IndexPartition(68,21); is:IndexSpace(200,21); colors:17,
logs/log_1.log:[1 - 7f25e1181000]    2.002508 {3}{DG}: In top-level     point:(1); ip:IndexPartition(70,21); is:IndexSpace(197,21); colors:17,
logs/log_1.log:[1 - 7f25e1181000]    2.002650 {3}{DG}: In top-level     point:(2); ip:IndexPartition(53,21); is:IndexSpace(202,21); colors:
logs/log_1.log:[1 - 7f25e1181000]    2.002698 {3}{DG}: In top-level     point:(3); ip:IndexPartition(51,21); is:IndexSpace(199,21); colors:17,

The logging is inconsistent for point 1 and still empty from point 2. Then shard 0 hangs when trying to ouput the logging for point 2. Here I expect both shards to log the exact same thing with color 17 in all points.

Another thing that I can't explain is that point 0 and 1 in the index launch seem to run concurrently (because the log entries get intertwined) but they both map to shard 0 which only have 1 omp proc (and 0 cpu proc).

Are my expectations correct?

lightsighter commented 2 years ago

If you're creating partitions (or names for any legion objects), then only way to safely ensure that they are ready right now is to pass their names back through future values. Legion only tracks dependences through data types like futures and regions, so you mostly likely want to pass the names of those partitions back through futures to return their names back to the parent task. If you want to do things more implicitly and call get_index_space_partition_colors then you must first synchronize execution between the parent task and the child tasks in the index launch to ensure that each child task is done creating its partition.

With some effort Legion could implicitly start doing this synchronization for you, but it requires more logic and is not quite trivial to implement.

bandokihiro commented 2 years ago

you mostly likely want to pass the names of those partitions back through futures to return their names back to the parent task

In this index launch, I am creating 2 IndexPartition via create_partition_by_domains. I tried to return both of them in a result struct but the logging at the top-level didn't change. I still issue the taks like

    runtime->execute_index_space(ctx, launcher).wait_all_results(true /*silence warnings*/);
    runtime->issue_execution_fence(ctx).get_void_result(true /*silence warnings*/);

then you must first synchronize execution between the parent task and the child tasks in the index launch

I thought issue_execution_fence(ctx).get_void_result() would handle such synchronization but I guess the top-level context is a different one. Anyway, I tried to put that at the end of the index task as well and it didn't change the top-level logging. I wonder if you were referring to some other synchronization mechanism.

lightsighter commented 2 years ago

I'm still really struggling to understand the structure of your program. Please make a detailed Legion Spy log and then use the unique IDs from the a Legion Spy (dataflow or event) graph generated with the -dez options to describe what is going wrong. Please upload the Legion Spy log as well.

bandokihiro commented 2 years ago

I run with 2 ranks and a color space of size 4. Points 0 and 1 should map to shard 0 and points 2 and 3 to shard 1. I put an execution fence right before the section that hangs.

I do an index launch of size 4. In each point task (UIDs 465, 467, 472, 474), I partition a second time the associated sub-region of the first-level disjoint complete partition. I do this first by partitioning the 1D sub-index space by field (UIDs 469, 471, 476, 478 I think) resulting in 4 second-level index partitions with dense sub-regions (color 17). I then extend these to the 2D and 3D index spaces by using create_partition_by_domain which creates a creation op and a pending partition op in each point task. I return the 2 index spaces created by create_partition_by_domain as a future. Before returning, I put an execution fence in that index launch context. Then, in top-level, I wait for all futures and I again put an execution fence in the top-level context.

I expect to see in each shard, for each sub-index space of the first-level partition, the color of the second-level index partition (17). For each first-level sub-index space of the 2D index space, this partition by domain should look like this

logs/log_0.log:[0 - 7f37dd5a5000]    1.078538 {3}{DG}: create_partition_by_domain at point (0) on IndexSpace(122,15) with color 17 with map ((0),<1,0>..<0,2>) ((1),<759,0>..<761,2>) ((2),<762,0>..<764,2>) ((3),<765,0>..<807,2>)
logs/log_0.log:[0 - 7f37dcf6f000]    1.079367 {3}{DG}: create_partition_by_domain at point (1) on IndexSpace(119,15) with color 17 with map ((0),<1561,0>..<1602,2>) ((1),<1,0>..<0,2>) ((2),<1603,0>..<1639,2>) ((3),<1,0>..<0,2>)
logs/log_1.log:[1 - 7f3630d5d000]    1.081552 {3}{DG}: create_partition_by_domain at point (3) on IndexSpace(121,15) with color 17 with map ((0),<3199,0>..<3199,2>) ((1),<1,0>..<0,2>) ((2),<1,0>..<0,2>) ((3),<1,0>..<0,2>)
logs/log_1.log:[1 - 7f3630f6f000]    1.083107 {3}{DG}: create_partition_by_domain at point (2) on IndexSpace(124,15) with color 17 with map ((0),<2400,0>..<2401,2>) ((1),<1,0>..<0,2>) ((2),<1,0>..<0,2>) ((3),<2402,0>..<2439,2>)

At the top-level, I get an incomplete and inconsistent (between shards) logging

logs/log_0.log:[0 - 7f37dd393000]    1.089351 {3}{DG}: In top-level     point:(0); ip:IndexPartition(44,15); is:IndexSpace(122,15); colors:17,
logs/log_0.log:[0 - 7f37dd393000]    1.089505 {3}{DG}: In top-level     point:(1); ip:IndexPartition(46,15); is:IndexSpace(119,15); colors:
logs/log_1.log:[1 - 7f3631393000]    1.090113 {3}{DG}: In top-level     point:(0); ip:IndexPartition(44,15); is:IndexSpace(122,15); colors:17,
logs/log_1.log:[1 - 7f3631393000]    1.090322 {3}{DG}: In top-level     point:(1); ip:IndexPartition(46,15); is:IndexSpace(119,15); colors:17,
logs/log_1.log:[1 - 7f3631393000]    1.090473 {3}{DG}: In top-level     point:(2); ip:IndexPartition(37,15); is:IndexSpace(124,15); colors:
logs/log_1.log:[1 - 7f3631393000]    1.090518 {3}{DG}: In top-level     point:(3); ip:IndexPartition(35,15); is:IndexSpace(121,15); colors:17,

Shard 0 hangs when asking for the logical partition 17 of the first-level sub-region of point 2 based on index space IndexSpace(124,15). logs.zip

lightsighter commented 2 years ago

Are you waiting on all futures in all shards? What happens when you run with -lg:inorder?

If possible make me a reproducer on sapling and I will try to figure out what is happening.

bandokihiro commented 2 years ago

Yes, I am waiting on all futures in all shards. It also hangs with -lg:inorder. 976144 on g0001 and 595710 on g0002 are hanging,

lightsighter commented 2 years ago

I need a way to run this one for myself. A debug binary is fine.

bandokihiro commented 2 years ago

To run

cd /scratch2/bandokihiro/Issue1257/2Nodes_Heterogeneous
sbatch runslurm.sh

To make changes (on a gpu node)

cd /scratch2/bandokihiro/Issue1257/legion
(...)
cd /scratch2/bandokihiro/Issue1257/legion/build
make -j install
cd /scratch2/bandokihiro/Issue1257/DG-Legion/build
make -j solver
lightsighter commented 2 years ago

Pull and try again.

bandokihiro commented 2 years ago

The behavior hasn't changed, the logging looks identical.

bandokihiro commented 2 years ago

As a side note, the one-point-per-shard configuration also hangs now with the changes introduced by fixsendnodes (this configuration works on 4b7f3f57).

lightsighter commented 2 years ago

Are you sure you pulled? Running the binary above does not reflect the changes in the most recent commit.

bandokihiro commented 2 years ago

I haven't updated that one, this is a separate install that you can modify

lightsighter commented 2 years ago

Pull and try again, the rebuild instructions are not working so I can't test it myself:

make[2]: *** No rule to make target `/usr/local/cuda-11.1/include/crt/common_functions.h', needed by `runtime/realm_kokkos_interop.cc.o'.  Stop.
make[1]: *** [runtime/CMakeFiles/RealmRuntime.dir/all] Error 2
make: *** [all] Error 2
bandokihiro commented 2 years ago

You need to be on a gpu node to rebuild, you can either ssh or use something like this

srun --nodes 1 --ntasks 1 --cpus-per-task 40 --exclusive -p gpu --pty bash

It seems to be fixed, thanks. Debug runs don't hang anymore and I did a couple of runs in release mode and the solution looks correct. Before closing, I have two questions.

I) My current code does the following:

  1. I return in the form of futures index partitions that are created in an index launch
  2. top-level waits with runtime->execute_index_space(ctx, launcher).wait_all_results(true /*silence warnings*/);

Is any of the above not necessary?

II) If i examine the logging at the top-level, even with the 2 steps above, it is inconsistent across the two shards

bandokihiro@sapling:~/Runs/Vortex/03_TestAndCheck/2Nodes_Heterogeneous$ grep 'In top-level' logs/log_*
logs/log_0.log:[0 - 7f01a4041000]    1.008137 {3}{DG}: In top-level     point:(0); ip:IndexPartition(68,21); is:IndexSpace(200,21); colors:17,
logs/log_0.log:[0 - 7f01a4041000]    1.008348 {3}{DG}: In top-level     point:(1); ip:IndexPartition(70,21); is:IndexSpace(197,21); colors:
logs/log_0.log:[0 - 7f01a4041000]    1.008614 {3}{DG}: In top-level     point:(2); ip:IndexPartition(55,21); is:IndexSpace(202,21); colors:17,
logs/log_0.log:[0 - 7f01a4041000]    1.009150 {3}{DG}: In top-level     point:(3); ip:IndexPartition(51,21); is:IndexSpace(199,21); colors:17,
logs/log_1.log:[1 - 7f2f98041000]    1.009206 {3}{DG}: In top-level     point:(0); ip:IndexPartition(68,21); is:IndexSpace(200,21); colors:17,
logs/log_1.log:[1 - 7f2f98041000]    1.009441 {3}{DG}: In top-level     point:(1); ip:IndexPartition(70,21); is:IndexSpace(197,21); colors:17,
logs/log_1.log:[1 - 7f2f98041000]    1.009660 {3}{DG}: In top-level     point:(2); ip:IndexPartition(55,21); is:IndexSpace(202,21); colors:17,
logs/log_1.log:[1 - 7f2f98041000]    1.009721 {3}{DG}: In top-level     point:(3); ip:IndexPartition(51,21); is:IndexSpace(199,21); colors:17,

I am concerned I just get lucky with these small cases which complete without issues. The code looks like this

    // in top-level

    // I am wating for all 2nd-level index partitions that are returned as a future, see issue 1257
    runtime->execute_index_space(ctx, launcher).wait_all_results(true /*silence warnings*/);

    main_logger.print() << "color_space = " << color_space;
    for (Domain::DomainPointIterator ilvl1(color_space); ilvl1; ilvl1++) {
        const LogicalRegion lr = runtime->get_logical_subregion_by_color(
            ctx, iface_2D_shar_lp, *ilvl1);
        const LogicalPartition lp = runtime->get_logical_partition_by_color(
            ctx, lr, PID_LVL2_IFACE_BY_DEST);

        std::set<Color> colors;
        runtime->get_index_space_partition_colors(ctx, lr.get_index_space(), colors);
        stringstream msg;
        msg << "In top-level\tpoint:" << *ilvl1 <<"; ip:" << lp.get_index_partition() << "; is:" << lr.get_index_space() << "; colors:";
        for (auto itr=colors.begin(); itr!=colors.end(); itr++) {
            msg << *itr << ",";
        }
        main_logger.print() << msg.str();
    }
lightsighter commented 2 years ago

You're not doing anything wrong. Reasoning about the visibility of mutations to the shape of the region tree under a deferred execution model is an under-specified part of the Legion programming model at the moment. The runtime does all sorts of analysis on ensuring apparently sequential semantics with regards to the actual data in the logical regions, but there are not corresponding "region requirements" at the moment if tasks want to say that they are going to be mutating the shape of the index space and region trees. In the case of get_index_space_partition_colors you're actually just sampling the state of that index space on one node at a given point in time, and potentially not observing new partitions made by subtasks from other shards yet. The runtime can "get away" with this for now because it's a sampling call. If you had instead called get_index_partition(lr.get_index_space(), 17) then the runtime would actually have looked for index partition with color 17, and if it didn't find it send the message to the owner node for the index space and then wait for index partition with color 17 to show up, so the "right" thing would have happened. The sampling methods though that just ask what is currently there are definitely "weak" at the moment. There is previous discussion on this issue on #915. It is a good topic for a Legion group meeting as any changes to allow the runtime to give strong guarantees would require buy in from a bunch of different users.

bandokihiro commented 2 years ago

Sounds good, thank you very much.