conda-forge / mpich-feedstock

A conda-smithy repository for mpich.
BSD 3-Clause "New" or "Revised" License
2 stars 26 forks source link

Shared memory not working with spawned process #78

Closed Retribution98 closed 1 year ago

Retribution98 commented 1 year ago

Solution to issue cannot be found in the documentation.

Issue

Hello!

I'm trying to use shared memory with spawned processes but I'm having problems using mpich - the child process's shared memory is different from the master process memory.

I have prepared the simple reproducer. Please save it as mpich_problem.py and run it like this:

mpiexec -n 1 python mpich_problem.py

It works with mpi4py on openmpi=4.1.5 but does not work with mpi4py on mpich:

import sys
import pickle
from mpi4py import MPI

########################
# Threads initializing #
########################

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
parent_comm = MPI.Comm.Get_parent()

if rank == 0 and parent_comm == MPI.COMM_NULL and size == 1:
    nprocs_to_spawn = 3
    args = ["mpich_problem.py"]
    info = MPI.Info.Create()
    intercomm = MPI.COMM_SELF.Spawn(
        sys.executable,
        args,
        maxprocs=nprocs_to_spawn,
        info=info,
        root=rank,
    )
    comm = intercomm.Merge(high=False)

if parent_comm != MPI.COMM_NULL:
    comm = parent_comm.Merge(high=True)

rank = comm.Get_rank()
size = comm.Get_size()

##############################
# Shared memory initializing #
##############################

# I am going to use non-contiguous shared memory, but it's not necessary.
# The problem is the same as in the case of using contiguous shared memory.
info = MPI.Info.Create()
info.Set("alloc_shared_noncontig", "true")
win = MPI.Win.Allocate_shared(5000, MPI.BYTE.size, comm=comm, info=info)
buf, itemsize = win.Shared_query(0)

##########################
# Threads communications #
##########################

# Rank #0 creates the data, serializes it, and then puts it into shared memory.
if rank == 0:
    data = list(range(1000))
    s_data = pickle.dumps(data)
    buf[: len(s_data)] = s_data
    for i in range(1, 4):
        comm.send(len(s_data), dest=i)
        comm.Send(s_data, dest=i)
# The other ranks get length of the serialized data and try to deserialize it.
else:
    size = comm.recv(source=0)
    expected_buf = bytearray(size)
    comm.Recv(expected_buf, source=0)
    assert buf[:size] == expected_buf, "Shared buffer is not equal to expected buffer"

    obj = pickle.loads(buf[:size])
    print(f"{rank}: {len(obj)}")

##############
# Finish MPI #
##############

if not MPI.Is_finalized():
    MPI.Finalize()

Could you fix this problem?

Installed packages

# packages in environment at $USER_PATH/miniconda3/envs/mpi4py_mpich:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
ca-certificates           2022.12.7            ha878542_0    conda-forge
ld_impl_linux-64          2.40                 h41732ed_0    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 12.2.0              h65d4601_19    conda-forge
libgfortran-ng            12.2.0              h69a702a_19    conda-forge
libgfortran5              12.2.0              h337968e_19    conda-forge
libgomp                   12.2.0              h65d4601_19    conda-forge
libnsl                    2.0.0                h7f98852_0    conda-forge
libsqlite                 3.40.0               h753d276_0    conda-forge
libstdcxx-ng              12.2.0              h46fd767_19    conda-forge
libuuid                   2.38.1               h0b41bf4_0    conda-forge
libzlib                   1.2.13               h166bdaf_4    conda-forge
mpi                       1.0                       mpich    conda-forge
mpi4py                    3.1.4            py39h32b9844_0    conda-forge
mpich                     4.0.3              h846660c_100    conda-forge
ncurses                   6.3                  h27087fc_1    conda-forge
openssl                   3.1.0                h0b41bf4_0    conda-forge
pip                       23.0.1             pyhd8ed1ab_0    conda-forge
python                    3.9.16          h2782a2a_0_cpython    conda-forge
python_abi                3.9                      3_cp39    conda-forge
readline                  8.2                  h8228510_1    conda-forge
setuptools                67.6.1             pyhd8ed1ab_0    conda-forge
tk                        8.6.12               h27826a3_0    conda-forge
tzdata                    2023c                h71feb2d_0    conda-forge
wheel                     0.40.0             pyhd8ed1ab_0    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge

Environment info

active environment : mpi4py_mpich
    active env location : $USER_PATH/miniconda3/envs/mpi4py_mpich
            shell level : 2
       user config file : $USER_PATH/.condarc
 populated config files : 
          conda version : 23.1.0
    conda-build version : not installed
         python version : 3.9.16.final.0
       virtual packages : __archspec=1=x86_64
                          __glibc=2.35=0
                          __linux=5.15.0=0
                          __unix=0=0
       base environment : $USER_PATH/miniconda3  (writable)
      conda av data dir : $USER_PATH/miniconda3/etc/conda
  conda av metadata url : None
           channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/linux-64
                          https://repo.anaconda.com/pkgs/r/noarch
          package cache : $USER_PATH/miniconda3/pkgs
                          $USER_PATH/.conda/pkgs
       envs directories : $USER_PATH/miniconda3/envs
                          $USER_PATH/.conda/envs
               platform : linux-64
             user-agent : conda/23.1.0 requests/2.28.1 CPython/3.9.16 Linux/5.15.0-67-generic ubuntu/22.04.1 glibc/2.35
                UID:GID : 1002:1002
             netrc file : None
           offline mode : False
dalcinl commented 1 year ago

This is an MPICH upstream issue, and not something the conda-forge MPICH recipe maintainer can do anything about. In fact, I've reported/commented about it in pmodels/mpich#6100, although within the context of point-to-point performance.

@hzhou Just FYI. Looks like this matter keeps showing up.

YarShev commented 1 year ago

Is there an issue in the MPICH upstream other than https://github.com/pmodels/mpich/discussions/6100 we should keep an eye on?

dalcinl commented 1 year ago

@YarShev Not that I know of. Maybe you should open your on issue. While I believe pmodels/mpich#6100 is strongly related, perhaps the one raised here could have a quicker resolution or easy-to-implement fix (though I doubt it). Also note that, as discussed in pmodels/mpich#6100, this is not really an MPICH bug, but rather an unfortunate limitation of the current library design.