dask / dask-mpi

Deploy Dask using MPI4Py
BSD 3-Clause "New" or "Revised" License
52 stars 29 forks source link

`dask-mpi` fails with wheel packaging #83

Open mahendrapaipuri opened 2 years ago

mahendrapaipuri commented 2 years ago

Using pip install dask-mpi

$ pip install dask-mpi
$ mpirun -np 2 dask-mpi --name=test-worker --nthreads=1 --memory-limit=0 --scheduler-file=test.json
distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at: tcp://172.16.66.109:36539
distributed.scheduler - INFO -   dashboard at:                     :8787
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.66.109:36297'
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  getting local rank failed
  --> Returned value No permission (-17) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value No permission (-17) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "No permission" (-17) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)

Using python setup.py install

$ python setup.py install
$ mpirun -np 2 dask-mpi --name=test-worker --nthreads=1 --memory-limit=0 --scheduler-file=test.json
distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at: tcp://172.16.66.109:44933
distributed.scheduler - INFO -   dashboard at:                     :8787
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.66.109:42437'
distributed.diskutils - INFO - Found stale lock file and directory '/home/mpaipuri/downloads/dask-mpi/dask-worker-space/worker-6h2hf4i6', purging
distributed.worker - INFO -       Start worker at:  tcp://172.16.66.109:37893
distributed.worker - INFO -          Listening to:  tcp://172.16.66.109:37893
distributed.worker - INFO -          dashboard at:        172.16.66.109:45119
distributed.worker - INFO - Waiting to connect to:  tcp://172.16.66.109:44933
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -       Local Directory: /home/mpaipuri/downloads/dask-mpi/dask-worker-space/worker-t48hj0dc
distributed.worker - INFO - -------------------------------------------------
distributed.scheduler - INFO - Register worker <WorkerState 'tcp://172.16.66.109:37893', name: rascil-worker-1, status: undefined, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://172.16.66.109:37893
distributed.core - INFO - Starting established connection
distributed.worker - INFO -         Registered to:  tcp://172.16.66.109:44933
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection

What happened: Installing dask-mpi with wheel packaging fails but it works normally with egg packaging. Tested it on 2 different systems and same behaviour is observed

What you expected to happen: To work with both packaging methods

Anything else we need to know?: The only difference between two approaches is generated dask-mpi command line executable.

mahendrapaipuri commented 2 years ago

@kmpaul Could you please look into it when you have time? Cheers!

kmpaul commented 2 years ago

@mahendrapaipuri: Yes. I've been very busy these days, and I am currently (due to a fiasco transferring my settings to a new phone and now 2FA is not synced) locked out of PyPI, so I cannot take a look. From the looks of it, I may not have access to PyPI again for months, since they are notoriously slow in resetting people'e 2FA. Perhaps @andersy005 or @jacobtomlinson could help you.

kmpaul commented 2 years ago

FYI: This does not seem to be reproducible on a Mac. I'll look into trying this in a linux Docker container to try reproducing the bug.

@mahendrapaipuri: Can I ask you how you install the dependencies for Dask-MPI? Are you installing Dask-Distributed with conda? How are you installing mpi4py? In total, describe to me the steps that come before the first step you describe above (pip install dask-mpi) to set up your Python environment.

mahendrapaipuri commented 2 years ago

Hello @kmpaul Thanks for taking a look. I created a bare conda environment and installed everything using pip. So the steps are

conda create -n test python=3.8 -y
conda activate test
pip install "dask[complete]"
pip install mpi4py dask-mpi

That's pretty much how I created my environment. Please let me know if I am missing any more details!

kmpaul commented 2 years ago

@mahendrapaipuri: Thanks! I'll see what I can uncover in a Docker container.

kmpaul commented 2 years ago

And just to be clear, you are using a system install of OpenMPI and not installing OpenMPI with Conda, correct?

mahendrapaipuri commented 2 years ago

Well, I have tested using system installed OpenMPI and Spack installed OpenMPI. Both gave me same errors. But no, I did not use conda built OpenMPI.

kmpaul commented 2 years ago

Ok. Then I'll try to diagnose the issue with system-installed OpenMPI (and possibly test with a Conda-installed OpenMPI, too).

kmpaul commented 2 years ago

I can confirm the bug exists on Debian in a Docker container. I'm investigating why the wheel and egg installs yield different behaviors.

mahendrapaipuri commented 2 years ago

@kmpaul, I have tested on CentOS 8 too and ended up with same issue. I am not an expert in Python packaging, but what I had noticed is the difference between both approaches is just the generated dask-mpi binary file.

kmpaul commented 2 years ago

I'll take a look at that.

FYI: I've noticed that with the pip install dask-mpi version (i.e., the one that fails), it works if you disable the use of Nannies. That is, if you change your CLI command to:

$ mpirun -np 2 dask-mpi --name=test-worker --nthreads=1 --memory-limit=0 --scheduler-file=test.json --worker-class distributed.Worker

it starts up correctly.

kmpaul commented 2 years ago

@mahendrapaipuri: Thanks for the tip! I can verify that regardless of how you install dask-mpi, if you use the binary created by the egg-install, it works. So, now to get into why the egg-installed entry point works and the wheel-installed entry point does not.

mahendrapaipuri commented 2 years ago

@kmpaul Precisely!! I have noticed that too. I am quite curious why binary from egg-install works and not the one from wheel. I ran out of ideas on how to debug it when I was digging into it!! Thanks again for taking time and looking into it.

kmpaul commented 2 years ago

Ok. I've looked a little deeper and I've been able to simplify the egg-installed dask-mpi binary to take the following form (eliminating all unused try branches and unnecessary functions when the input is fixed):

#!/root/miniconda3/envs/test/bin/python
import re
import sys
from importlib.metadata import distribution

if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw?|\.exe)?$', '', sys.argv[0])
    go = distribution('dask-mpi').entry_points[0].load()
    sys.exit(go())

This works, as expected. However, if you simply make go a globally defined symbol, like so:

#!/root/miniconda3/envs/test/bin/python
import re
import sys
from importlib.metadata import distribution
go = distribution('dask-mpi').entry_points[0].load()

if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw?|\.exe)?$', '', sys.argv[0])
    sys.exit(go())

it now fails in the same way that the wheel-installed binary fails. This is not surprising since the wheel-installed binary looks like:

!/root/miniconda3/envs/test/bin/python
# -*- coding: utf-8 -*-
import re
import sys
from dask_mpi.cli import go
if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
    sys.exit(go())

So, this has to do with scope. I'll dig a little further to see why.

kmpaul commented 2 years ago

As you might expect, if you change the wheel-installed binary to look like this:

!/root/miniconda3/envs/test/bin/python
# -*- coding: utf-8 -*-
import re
import sys
if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
    from dask_mpi.cli import go
    sys.exit(go())

It works.

kmpaul commented 2 years ago

@mahendrapaipuri: I've spent the day looking into this, and I cannot figure it out. I don't know why one entry point script should work and the other not work. It is a mystery to me.

mahendrapaipuri commented 2 years ago

@kmpaul Thanks a lot for looking into it. To me as well, sort of mystery. Hope someone else can figure it out.

kmpaul commented 2 years ago

I have another idea, and if I have time today, I am going to look into it. I'll let you know.