StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
657 stars 146 forks source link

Legion: Errors in replicating onto half the machine #1671

Closed rupanshusoi closed 3 months ago

rupanshusoi commented 3 months ago

I have a Regent program consisting of three control replicated tasks:

task orchestrator()
  toplevel0()
  toplevel1()
end
regentlib.start(orchestrator)

I want to replicate toplevel0 onto the first half of the machine, and toplevel1 onto the second half. I have modified replicate_task to do that:

Excerpt from `replicate_task` ``` if (!strcmp(task.get_task_name(), "toplevel0")) { output.target_processors.resize(total_nodes / 2, Processor::NO_PROC); for (size_t i = 0; i < remote_procs.size() / 2; i++) { auto it = &remote_procs[i]; AddressSpace space = it->address_space(); assert(space < output.target_processors.size()); assert(!output.target_processors[space].exists()); output.target_processors[space] = (*it); } } else if (!strcmp(task.get_task_name(), "toplevel1")) { output.target_processors.resize(total_nodes / 2, Processor::NO_PROC); int ctr = 0; for (size_t i = remote_procs.size() / 2; i < remote_procs.size(); i++, ctr++) { auto it = &remote_procs[i]; AddressSpace space = it->address_space(); assert(space == i); // assert(space < output.target_processors.size()); assert(!output.target_processors[space].exists()); output.target_processors[ctr] = (*it); } } ```

On 4 nodes, the mapping looks like what I want. toplevel0 is replicated onto nodes 0 and 1, and toplevel1 onto nodes 2 and 3. However, soon after the run fails with:

/global/u2/r/rsoi/legion/runtime/legion/legion_replication.cc:9054: void Legion::Internal::ShardManager::distribute_explicit(Legion::Internal::SingleTask*, Legion::VariantID, std::vector<Realm::Processor>&, std::vector<unsigned int>&): Assertion `!local_shards.empty()' failed.

The behaviour on 2 nodes is different. The run completes without errors (even in debug mode), but the mapping is not what I want: both toplevel0 and toplevel1 are replicated onto node 0. Here is an excerpt of the log from node 0:

[0 - 7fb47d8a2000]    0.358617 {2}{mapper}: MAP_REPLICATE_TASK for toplevel1<8> @ /global/homes/r/rsoi/restart/replica/pennant.rg:1913    
[0 - 7fb47d8a2000]    0.358780 {2}{mapper}:   REPLICANT 0 -> 1d00010000000009
[0 - 7fb47d8a2000]    0.358976 {2}{mapper}: MAP_TASK for toplevel1<8> @ /global/homes/r/rsoi/restart/replica/pennant.rg:1913
[0 - 7fb47d8a2000]    0.359087 {2}{mapper}:   TARGET PROCS: 1d00000000000009
[0 - 7fb47d8a2000]    0.359191 {2}{mapper}:   CHOSEN INSTANCES:

Note how REPLICANT 0 for toplevel1 is supposed to be sent to node 1, but map_task for it is called on node 0 instead. The only map_task call on node 1 is for orchestrator, their parent task.

The other potentially interesting thing is that both toplevel* tasks have a single shard in both runs, and those shards are mapped to node 0 by my sharding functor. I don't understand the relationship between shards and replicants, so this may or may not be a problem.

Full logs below:

4 node run: 3-4.log 2-4.log 1-4.log 0-4.log

2 node run: 1-2.log 0-2.log

I'm on Legion 9310c7.

lightsighter commented 3 months ago

Do you have a backtrace? What about a small reproducer program?

lightsighter commented 3 months ago

The run completes without errors (even in debug mode), but the mapping is not what I want:

This is always the fault of your mapper.

rupanshusoi commented 3 months ago

Here's the backtrace. I can try making a reproducer on Sapling tonight.

This is always the fault of your mapper.

I don't understand how it's possible that map_replicate says that a replicant of the task is being sent to node 1, but there is no map_task call corresponding to that replicant on that node. What happens between map_replicate and map_task that could interfere like this?

lightsighter commented 3 months ago

If your call to map_replicate only makes one shard that does not require control replication and the runtime will ignore your request to control replicate the task and fall back to the normal mapping path. One shard is not replication.

rupanshusoi commented 3 months ago

Here is a reproducer. Please run test.sh; you might need to modify the location of regent.py.

lightsighter commented 3 months ago

I will try to work on this tomorrow. For now you can work around this by using select_tasks_to_map to move the task being replicated to one of the nodes where at least one shard will exist before replicate_task is called.

lightsighter commented 3 months ago

Is there a reason you sharded toplevel1 onto shard 0 (on node 0) of orchestrator instead of sharding it onto shard 1 (on node 1) of orchestrator? That will be much better for performance regardless of whether we fix this or not.

I'm leaning towards not fixing this. The runtime already has a strict separation between "moving" tasks and "mapping" tasks. You can only move tasks around between processors and nodes using select_tasks_to_map or the stealing mapper calls. However, once you decide to map a task on a node in select_task_to_map then you have to map it onto a local processor. If you try to pick a target processor in map_task that is remote from the local node you will get an error. The same should be true for replicate_task since it comes after you decide to map the task on the local node using select_tasks_to_map. I don't think it's unreasonable to expect one of the shards to be local to wherever the replication is being done since ultimately we should be able to callmap_task on one of those shards locally which we would mandate anyway even if the task is not replicated. I do need to fix the error message though to actually say that is what is going wrong.

I'll note that this doesn't mean you have to get your sharding perfect (although that would be best for performance). You can still use select_task_to_map to move around tasks to the node where you want them to be replicated even if you sharded them to a different node. So you can still shard toplevel1 onto shard 0 on node 0. You just need to use select_tasks_to_map (or the stealing mapper calls) to move it remotely to node 1 before you decide to map/replicate it.

rupanshusoi commented 3 months ago

Is there a reason you sharded toplevel1 onto shard 0 (on node 0) of orchestrator instead of sharding it onto shard 1 (on node 1) of orchestrator? That will be much better for performance regardless of whether we fix this or not.

No, it's just what my sharding functor was doing by default.

The only thing I care about is the mapping I want, i.e. mapping toplevel0 to the first half of the machine, and toplevel1 to the second. What is the most performant strategy to achieve that mapping?

If I have N nodes, should I shard toplevel1 to node 1 or node N/2 or something else? After sharding, can I continue to replicate toplevel0 to nodes 0..N/2 and toplevel1 to nodes N/2..N and expect good performance?

lightsighter commented 3 months ago

If I have N nodes, should I shard toplevel1 to node 1 or node N/2 or something else? After sharding, can I continue to replicate toplevel0 to nodes 0..N/2 and toplevel1 to nodes N/2..N and expect good performance?

All you need to do is shard toplevel0 to a shard of orchestrator that is on nodes [0,N/2) and to shard toplevel1 to a shard of orchestrator that is on nodes [N/2,N).

rupanshusoi commented 3 months ago

This is working for me, thanks. Feel free to close the issue if this is a "wontfix".

lightsighter commented 3 months ago

I added a better error message here: https://gitlab.com/StanfordLegion/legion/-/merge_requests/1200