Open mahendrapaipuri opened 2 years ago
@kmpaul Could you please look into it when you have time? Cheers!
@mahendrapaipuri: Yes. I've been very busy these days, and I am currently (due to a fiasco transferring my settings to a new phone and now 2FA is not synced) locked out of PyPI, so I cannot take a look. From the looks of it, I may not have access to PyPI again for months, since they are notoriously slow in resetting people'e 2FA. Perhaps @andersy005 or @jacobtomlinson could help you.
FYI: This does not seem to be reproducible on a Mac. I'll look into trying this in a linux Docker container to try reproducing the bug.
@mahendrapaipuri: Can I ask you how you install the dependencies for Dask-MPI? Are you installing Dask-Distributed with conda? How are you installing mpi4py? In total, describe to me the steps that come before the first step you describe above (pip install dask-mpi
) to set up your Python environment.
Hello @kmpaul Thanks for taking a look. I created a bare conda
environment and installed everything using pip
. So the steps are
conda create -n test python=3.8 -y
conda activate test
pip install "dask[complete]"
pip install mpi4py dask-mpi
That's pretty much how I created my environment. Please let me know if I am missing any more details!
@mahendrapaipuri: Thanks! I'll see what I can uncover in a Docker container.
And just to be clear, you are using a system install of OpenMPI and not installing OpenMPI with Conda, correct?
Well, I have tested using system installed OpenMPI and Spack installed OpenMPI. Both gave me same errors. But no, I did not use conda built OpenMPI.
Ok. Then I'll try to diagnose the issue with system-installed OpenMPI (and possibly test with a Conda-installed OpenMPI, too).
I can confirm the bug exists on Debian in a Docker container. I'm investigating why the wheel and egg installs yield different behaviors.
@kmpaul, I have tested on CentOS 8 too and ended up with same issue. I am not an expert in Python packaging, but what I had noticed is the difference between both approaches is just the generated dask-mpi
binary file.
I'll take a look at that.
FYI: I've noticed that with the pip install dask-mpi
version (i.e., the one that fails), it works if you disable the use of Nannies. That is, if you change your CLI command to:
$ mpirun -np 2 dask-mpi --name=test-worker --nthreads=1 --memory-limit=0 --scheduler-file=test.json --worker-class distributed.Worker
it starts up correctly.
@mahendrapaipuri: Thanks for the tip! I can verify that regardless of how you install dask-mpi
, if you use the binary created by the egg-install, it works. So, now to get into why the egg-installed entry point works and the wheel-installed entry point does not.
@kmpaul Precisely!! I have noticed that too. I am quite curious why binary from egg-install works and not the one from wheel. I ran out of ideas on how to debug it when I was digging into it!! Thanks again for taking time and looking into it.
Ok. I've looked a little deeper and I've been able to simplify the egg-installed dask-mpi
binary to take the following form (eliminating all unused try branches and unnecessary functions when the input is fixed):
#!/root/miniconda3/envs/test/bin/python
import re
import sys
from importlib.metadata import distribution
if __name__ == '__main__':
sys.argv[0] = re.sub(r'(-script\.pyw?|\.exe)?$', '', sys.argv[0])
go = distribution('dask-mpi').entry_points[0].load()
sys.exit(go())
This works, as expected. However, if you simply make go
a globally defined symbol, like so:
#!/root/miniconda3/envs/test/bin/python
import re
import sys
from importlib.metadata import distribution
go = distribution('dask-mpi').entry_points[0].load()
if __name__ == '__main__':
sys.argv[0] = re.sub(r'(-script\.pyw?|\.exe)?$', '', sys.argv[0])
sys.exit(go())
it now fails in the same way that the wheel-installed binary fails. This is not surprising since the wheel-installed binary looks like:
!/root/miniconda3/envs/test/bin/python
# -*- coding: utf-8 -*-
import re
import sys
from dask_mpi.cli import go
if __name__ == '__main__':
sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
sys.exit(go())
So, this has to do with scope. I'll dig a little further to see why.
As you might expect, if you change the wheel-installed binary to look like this:
!/root/miniconda3/envs/test/bin/python
# -*- coding: utf-8 -*-
import re
import sys
if __name__ == '__main__':
sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
from dask_mpi.cli import go
sys.exit(go())
It works.
@mahendrapaipuri: I've spent the day looking into this, and I cannot figure it out. I don't know why one entry point script should work and the other not work. It is a mystery to me.
@kmpaul Thanks a lot for looking into it. To me as well, sort of mystery. Hope someone else can figure it out.
I have another idea, and if I have time today, I am going to look into it. I'll let you know.
Using
pip install dask-mpi
Using
python setup.py install
What happened: Installing
dask-mpi
with wheel packaging fails but it works normally with egg packaging. Tested it on 2 different systems and same behaviour is observedWhat you expected to happen: To work with both packaging methods
Anything else we need to know?: The only difference between two approaches is generated
dask-mpi
command line executable.