Azure / msccl

Microsoft Collective Communication Library
MIT License
50 stars 6 forks source link

Question about synthesizing Allreduce #36

Open JASUEXIII opened 2 months ago

JASUEXIII commented 2 months ago

Hi. Thanks for previous prompt response. I'm currently tring to synthesize the Allreduce for a custom topology(let's say a ring with 4 or 8 nodes as an example). Some strange problems occurs when doing so. I wonder if you can help.

My Codes:

topology = Ring(num_Nodes=4)
from msccl.collectives import allgather,allreduce,reduce_scatter,reduce,alltoall
collective_allgather = allgather(topology.num_nodes())
collective_reduce_scatter = reduce_scatter(topology.num_nodes())
save_msccl_object(topology,'SG2260_topo_Ring4.json')
save_msccl_object(collective_allgather,'coll_allgather.json')
save_msccl_object(collective_reduce_scatter,'coll_reducescatter.json')
assert 0 == os.system('msccl solve pareto-optimal custom custom --topology-file SG2260_topo_Ring4.json --collective-file coll_allgather.json')
assert 0 == os.system('msccl solve pareto-optimal custom custom --topology-file SG2260_topo_Ring4.json --collective-file coll_reducescatter.json')
assert 0 == os.system('msccl compose allreduce ReduceScatter.n4-MYTP-steps2.rounds3.chunks2.msccl.json Allgather.n4-MYTP-steps2.rounds3.chunks2.msccl.json -o allreduce_ring4.json')

I stored the collective also into json file for better debug. The logged allreduce json has strange input and output map as follows:

"input_map": { "0": [0, 1], "1": [0, 1], "2": [0, 1], "3": [0, 1] },
  "output_map": { "0": [0, 1], "1": [0, 1], "2": [0, 1], "3": [0, 1] },
  "steps": [
    {
      "msccl_type": "step",
      "rounds": 1,
      "sends": [
        [0, 2, 1],
        [1, 2, 3],
        [2, 3, 0],
        [3, 3, 2],
        [4, 0, 3],
        [5, 0, 1],
        [6, 1, 0],
        [7, 1, 2]
      ]
    },
    {
      "msccl_type": "step",
      "rounds": 2,
      "sends": [
        [0, 1, 0],
        [0, 3, 0],
        [1, 1, 0],
        [1, 3, 0],
        [2, 0, 1],
        [2, 2, 1],
        [3, 0, 1],
        [3, 2, 1],
        [4, 1, 2],
        [4, 3, 2],
        [5, 1, 2],
        [5, 3, 2],
        [6, 0, 3],
        [6, 2, 3],
        [7, 0, 3],
        [7, 2, 3]
      ]
    },
    {
      "msccl_type": "step",
      "rounds": 2,
      "sends": [
        [0, 0, 1],
        [0, 0, 3],
        [1, 0, 1],
        [1, 0, 3],
        [2, 1, 0],
        [2, 1, 2],
        [3, 1, 0],
        [3, 1, 2],
        [4, 2, 1],
        [4, 2, 3],
        [5, 2, 1],
        [5, 2, 3],
        [6, 3, 0],
        [6, 3, 2],
        [7, 3, 0],
        [7, 3, 2]
      ]
    },
    {
      "msccl_type": "step",
      "rounds": 1,
      "sends": [
        [0, 1, 2],
        [1, 3, 2],
        [2, 0, 3],
        [3, 2, 3],
        [4, 3, 0],
        [5, 1, 0],
        [6, 0, 1],
        [7, 2, 1]
      ]
    }
  ],
  "collective": {
    "msccl_type": "collective",
    "name": "Allreduce(n=4)",
    "nodes": 4,
    "chunks": [
      { "msccl_type": "chunk", "pre": [0], "post": [0, 1, 2, 3], "addr": 0 },
      { "msccl_type": "chunk", "pre": [1], "post": [0, 1, 2, 3], "addr": 0 },
      { "msccl_type": "chunk", "pre": [2], "post": [0, 1, 2, 3], "addr": 0 },
      { "msccl_type": "chunk", "pre": [3], "post": [0, 1, 2, 3], "addr": 0 }

I want to know how to understand this output. The chunck id seems to not match with each other. And the input/output map is not a proper solution for allreduce. I'll be really appreciated and happy to offer other trail logs if anyone can help.