facebookincubator / gloo

Collective communications library with various primitives for multi-machine training.
Other
1.23k stars 303 forks source link

reduce_scatter on MacOS issue #365

Open ChengjieLi28 opened 1 year ago

ChengjieLi28 commented 1 year ago

Hi team,

I successfully compiled gloo on MacOS by setting USE_LIBUV ON, but when I test the reduce_scatter OP, I found that core dump at runtime.

I use pybind11 to bind python interface, here's the code:

def worker_reduce_scatter(rank):
    from .. import xoscar_pygloo as xp

    if rank == 0:
        if os.path.exists(fileStore_path):
            shutil.rmtree(fileStore_path)
        os.makedirs(fileStore_path)
    else:
        time.sleep(0.5)

    context = xp.rendezvous.Context(rank, 3)

    if system_name == "Linux":
        attr = xp.transport.tcp.attr("localhost")
        dev = xp.transport.tcp.CreateDevice(attr)
    else:
        attr = xp.transport.uv.attr("localhost")
        dev = xp.transport.uv.CreateDevice(attr)

    fileStore = xp.rendezvous.FileStore(fileStore_path)
    store = xp.rendezvous.PrefixStore(str(3), fileStore)

    context.connectFullMesh(store, dev)

    sendbuf = np.array(
        [i + 1 for i in range(sum([j + 1 for j in range(3)]))], dtype=np.float32
    )
    print(f'Send buf: {sendbuf}')
    sendptr = sendbuf.ctypes.data

    recvbuf = np.zeros(2, dtype=np.float32)
    recvptr = recvbuf.ctypes.data
    recvElems = [2, 2, 2]

    data_size = (
        sendbuf.size if isinstance(sendbuf, np.ndarray) else sendbuf.numpy().size
    )
    print(f'Data size: {data_size}')
    datatype = xp.glooDataType_t.glooFloat32
    op = xp.ReduceOp.SUM

    xp.reduce_scatter(context, sendptr, recvptr, data_size, recvElems, datatype, op)

    print(f"rank {rank} sends {sendbuf}, receives {recvbuf}")

def test_reduce_scatter():
    process1 = mp.Process(target=worker_reduce_scatter, args=(0,))
    process1.start()
    process2 = mp.Process(target=worker_reduce_scatter, args=(1,))
    process2.start()
    process3 = mp.Process(target=worker_reduce_scatter, args=(2,))
    process3.start()

    process1.join()
    process2.join()
    process3.join()

This test not work on MacOS, but works on Linux.

May I ask that why this happens? Thank you very much.