Closed chengzhuzhang closed 3 years ago
@chengzhuzhang, could you see if you add mpi4py to your standalone environment if you see the same error? I think we saw MPI problems in the past in environments that had mpi4py from conda-forge. If it's not that, it must be the mpi version of some other package that's causing trouble.
$ source load_latest_e3sm_unified.sh
$ conda list | grep mpi
compass 0.1.8 nompi_py_h6eb0c47_100 e3sm
e3sm-unified 1.3.1.1 nompi_py37h6eb0c47_0 e3sm
esmf 8.0.1 nompi_hbeb3ca6_0 conda-forge
esmpy 8.0.1 nompi_py37h777d1d2_0 conda-forge
hdf5 1.10.6 nompi_h3c11f04_100 conda-forge
libnetcdf 4.7.4 nompi_h84807e1_105 conda-forge
mpi 1.0 openmpi conda-forge
mpi4py 3.0.3 py37hbfacf26_1 conda-forge
netcdf-fortran 4.5.3 nompi_hfef6a68_100 conda-forge
netcdf4 1.5.3 nompi_py37hdc49583_105 conda-forge
openmpi 4.0.4 hdf1f1ad_0 conda-forge
Another possibility that occurs to me is that the nompi version of esmf might not work for you if you're trying to run it with MPI.
I have to say, I've had no luck in general running MPI versions of conda packages on Cori nodes. If you're able to figure out what package is causing the problem I'm happy to help debug.
One more thing to try might be to see if load_latest_e3sm_unified_mpich.sh
works any better. I think CDAT didn't like mpich so maybe that's a bad option.
Finally, I don't install it but I build an OpenMPI version of E3SM-Unified. You could try installing that version yourself and see if it makes a difference.
My guess is that the MPI variants of E3SM-Unified probably won't help. But it doesn't hurt to try.
Thank you, Xylar!
Hi @forsyth2, I encountered this issue while making e3sm_diags tutorial, and will spend more time getting the tutorial done. Would you please invest this issue following Xylar's instructions? Thank you.
@chengzhuzhang I am able to reproduce the error with:
salloc --nodes=1 --partition=debug --time=00:30:00 -C haswell
source /global/cfs/cdirs/e3sm/software/anaconda_envs/load_latest_e3sm_unified.sh
cd tests/system
python all_sets.py -d all_sets.cfg
I tried conda install mpi4py
but I get:
EnvironmentNotWritableError: The current user does not have write permissions to the target environment.
environment location: /global/cfs/cdirs/e3sm/software/anaconda_envs/base/envs/e3sm_unified_1.3.1.1
I tried:
source /global/cfs/cdirs/e3sm/software/anaconda_envs/load_latest_e3sm_unified_mpich.sh
cd tests/system
python all_sets.py -d all_sets.cfg
This appears to be successful.
Thank you, @forsyth2. Good news with the successful run with mpich version. I don't think we can install packages in the unified env. I think what Xylar suggested is to install mpi4py
in to standalone e3sm_diags
env to see if that could cause trouble.
Yep, @forsyth2, I protect the e3sm-unified environments because they belong to everyone and that means it causes trouble if anyone but me installs packages.
I'm creating a "dev" environment for e3sm_diags
on my laptop with the following:
conda create -y -n e3sm_diags_env_dev -c cdat/label/v82 -c conda-forge -c defaults python=3.7 \
"cdp>=1.6.0" "vcs>=8.2" "vtk-cdat=8.2.0.8.2" "vcsaddons>=8.2" "dv3d>=8.2" "cdms2>=3.1.4" \
"cdutil>=8.2" "genutil>=8.2" "cdtime>=3.1.2" numpy matplotlib "cartopy>=0.18.0" beautifulsoup4 lxml
Here's what I'm seeing:
$ conda list -n e3sm_diags_env_dev | grep mpi
esmf 8.0.1 nompi_hbeb3ca6_0 conda-forge
esmpy 8.0.1 nompi_py37h777d1d2_0 conda-forge
hdf5 1.10.6 nompi_h3c11f04_101 conda-forge
libnetcdf 4.7.4 nompi_h84807e1_105 conda-forge
netcdf-fortran 4.5.3 nompi_hfef6a68_100 conda-forge
If I add mpi4py
, like e3sm-unified
has:
conda create -y -n e3sm_diags_env_dev -c cdat/label/v82 -c conda-forge -c defaults python=3.7 \
"cdp>=1.6.0" "vcs>=8.2" "vtk-cdat=8.2.0.8.2" "vcsaddons>=8.2" "dv3d>=8.2" "cdms2>=3.1.4" \
"cdutil>=8.2" "genutil>=8.2" "cdtime>=3.1.2" numpy matplotlib "cartopy>=0.18.0" beautifulsoup4 \
lxml mpi4py
I see:
$ conda list -n e3sm_diags_env_dev | grep mpi
esmf 8.0.1 nompi_hbeb3ca6_0 conda-forge
esmpy 8.0.1 nompi_py37h777d1d2_0 conda-forge
hdf5 1.10.6 nompi_h3c11f04_101 conda-forge
libnetcdf 4.7.4 nompi_h84807e1_105 conda-forge
mpi 1.0 openmpi conda-forge
mpi4py 3.0.3 py37hbfacf26_1 conda-forge
netcdf-fortran 4.5.3 nompi_hfef6a68_100 conda-forge
openmpi 4.0.4 hdf1f1ad_0 conda-forge
If I instead force the mpich versions of various libraries:
conda create -y -n e3sm_diags_env_dev -c cdat/label/v82 -c conda-forge -c defaults python=3.7 \
"cdp>=1.6.0" "vcs>=8.2" "vtk-cdat=8.2.0.8.2" "vcsaddons>=8.2" "dv3d>=8.2" "cdms2>=3.1.4" \
"cdutil>=8.2" "genutil>=8.2" "cdtime>=3.1.2" numpy matplotlib "cartopy>=0.18.0" beautifulsoup4 \
lxml mpi4py "libnetcdf=*=mpi_mpich_*" "esmf=*=mpi_mpich_*" "esmpy=*=mpi_mpich_*" \
"hdf5=*=mpi_mpich_*"
I see:
$ conda list -n e3sm_diags_env_dev | grep mpi
esmf 8.0.1 mpi_mpich_h213fab7_100 conda-forge
esmpy 8.0.1 mpi_mpich_py37hef66020_100 conda-forge
hdf5 1.10.6 mpi_mpich_ha7d0aea_1 conda-forge
libnetcdf 4.7.4 mpi_mpich_hfd9c5b6_5 conda-forge
mpi 1.0 mpich conda-forge
mpi4py 3.0.3 py37h0c5ec45_1 conda-forge
mpich 3.3.2 hc856adb_0 conda-forge
netcdf-fortran 4.5.3 mpi_mpich_h3923e1a_0 conda-forge
To explicitly control the build of a given package (nompi
, mpich
or openmpi
), you take advantage of the build string starting with nompi_*
, mpi_mpich_*
or mpi_openmpi_*
(see https://conda-forge.org/docs/maintainer/knowledge_base.html#message-passing-interface-mpi).
As you see above, the default behavior for most packages is to install the nompi
version (however, esmf
and esmpy
favor the mpich
version). The "default" version is determined by giving a package a higher build number (say, adding 100 to the build number of other versions). The package solver tries to have the highest possible build number for all packages that passes the constraints from each package.
The easiest thing to investigate is if maybe openmpi
is the problem. To test this, we would want the nompi
version of most packages (like you get by default) but instead of mpi4py
with openmpi
, we would want mpi4py
and mpich
:
conda create -y -n e3sm_diags_env_dev -c cdat/label/v82 -c conda-forge -c defaults python=3.7 \
"cdp>=1.6.0" "vcs>=8.2" "vtk-cdat=8.2.0.8.2" "vcsaddons>=8.2" "dv3d>=8.2" "cdms2>=3.1.4" \
"cdutil>=8.2" "genutil>=8.2" "cdtime>=3.1.2" numpy matplotlib "cartopy>=0.18.0" beautifulsoup4 \
lxml mpi4py mpich "libnetcdf=*=nompi_*" "esmf=*=nompi_*" "esmpy=*=nompi_*" "hdf5=*=nompi_*"
This results in:
$ conda list -n e3sm_diags_env_dev | grep mpi
esmf 8.0.1 nompi_hbeb3ca6_0 conda-forge
esmpy 8.0.1 nompi_py37h777d1d2_0 conda-forge
hdf5 1.10.6 nompi_h3c11f04_101 conda-forge
libnetcdf 4.7.4 nompi_h84807e1_105 conda-forge
mpi 1.0 mpich conda-forge
mpi4py 3.0.3 py37h0c5ec45_1 conda-forge
mpich 3.3.2 hc856adb_0 conda-forge
netcdf-fortran 4.5.3 nompi_hfef6a68_100 conda-forge
Could you see if that works, or if it produces the same error? If not, we know that the problem is just openmpi
vs. mpich
.
My feeling is that you ultimately want some way of deciding for yourselves if you want cdms2
to be using MPI or not. If you don't explicitly install mpi4py
in your dev environment, e3sm_diags
will run without MPI as I understand it. cdms2
checks whether to use MPI by checking if mpi4py
can be imported:
https://github.com/CDAT/cdms/blob/master/Lib/tvariable.py#L26-L32
My feeling is that this is a lazy shorthand and they should be checking if the libraries they actually need are compatible with MPI. If we can figure out which one(s), I'm happy to create an issue for this on their repo.
In e3sm-unified
, we need mpi4py
for the ilamb
package even when we don't want MPI versions of other packages. This has the side effect that cdms2
decided to use MPI whether we want it to or not. One suggestion would be that we request that they add an environment variable that would override the mpi4py
check and would disable MPI regardless. This could be set as part of activating the e3sm-unified environment without mpich
.
Or, if mpich
is working fine (with either mpich
or nompi
versions of other packages like libnetcdf
and esmf
), we could explicitly make sure mpich
instead of openmpi
gets installed with the nompi
variant of e3sm-unified
in the future.
I don't have experience running e3sm_diags
but it seems pretty easy so I would be happy to help with this debugging if you run into trouble.
@chengzhuzhang, if ironing this issue and the other plotting issues you've uncovered requires another "emergency" release of e3sm-unified, that's fine. If that's the case, let's try to make sure we do some thorough testing for the next "emergency" release so it's hopefully the last before next January.
@xylar @chengzhuzhang I created the 4 environments Xylar did and my conda list
outputs matched those.
I logged onto a Haswell node and activated the 4th environment.
python all_sets.py -d all_sets.cfg
gives:
/global/homes/f/forsyth/.conda/envs/e3sm_diags_env_6/lib/python3.7/site-packages/unidata/__init__.py:2: UserWarning: unidata package is deprecated please use genutil.udunits instead of unidata.udunits
warnings.warn("unidata package is deprecated please use genutil.udunits instead of unidata.udunits")
[]
[]
[]
[]
You have no value for ref_names. Caculate test data only
Saved environment yml file to: all_sets_results_test/prov/environment.yml
Saved command used to: all_sets_results_test/prov/cmd_used.txt
Saved cfg file to: all_sets_results_test/prov/all_sets.cfg
Saved Python script to: all_sets_results_test/prov/all_sets.py
Variable: T
Selected pressure level: [200.0]
Plot saved in: all_sets_results_test/zonal_mean_xy/ERA-Interim/ERA-Interim-T-200-ANN-global.png
CDMS system error: No such file or directory
CDMS I/O error: Opening file /global/homes/f/forsyth/.conda/envs/e3sm_diags_env_6/share/e3sm_diags/acme_ne30_ocean_land_mask.nc
Error in acme_diags.driver.zonal_mean_2d_driver
Traceback (most recent call last):
File "/global/homes/f/forsyth/.local/lib/python3.7/site-packages/acme_diags/driver/zonal_mean_2d_driver.py", line 82, in run_diag
land_frac = test_data.get_climo_variable('LANDFRAC', season)
File "/global/homes/f/forsyth/.local/lib/python3.7/site-packages/acme_diags/driver/utils/dataset.py", line 144, in get_climo_variable
variables = self._get_climo_var(filename, *args, **kwargs)
File "/global/homes/f/forsyth/.local/lib/python3.7/site-packages/acme_diags/driver/utils/dataset.py", line 337, in _get_climo_var
raise RuntimeError(msg)
RuntimeError: Variable 'LANDFRAC' was not in the file file:///global/u1/f/forsyth/e3sm_diags/tests/system/T_20161118.beta0.FC5COSP.ne30_ne30.edison_ANN_climo.nc, nor was it defined in the derived variables dictionary.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/global/homes/f/forsyth/.conda/envs/e3sm_diags_env_6/lib/python3.7/site-packages/cdms2/dataset.py", line 1275, in __init__
_fileobj_ = Cdunif.CdunifFile(path, mode)
OSError: Variable not found
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/global/homes/f/forsyth/.local/lib/python3.7/site-packages/acme_diags/acme_diags_driver.py", line 275, in run_diag
single_result = module.run_diag(parameters)
File "/global/homes/f/forsyth/.local/lib/python3.7/site-packages/acme_diags/driver/zonal_mean_2d_driver.py", line 86, in run_diag
with cdms2.open(mask_path) as f:
File "/global/homes/f/forsyth/.conda/envs/e3sm_diags_env_6/lib/python3.7/site-packages/cdms2/dataset.py", line 497, in openDataset
return CdmsFile(path, mode, mpiBarrier=CdMpi)
File "/global/homes/f/forsyth/.conda/envs/e3sm_diags_env_6/lib/python3.7/site-packages/cdms2/dataset.py", line 1277, in __init__
raise CDMSError('Cannot open file %s (%s)' % (path, err))
cdms2.error.CDMSError: Cannot open file /global/homes/f/forsyth/.conda/envs/e3sm_diags_env_6/share/e3sm_diags/acme_ne30_ocean_land_mask.nc (Variable not found)
So, it still fails, but now it's because of a CDMSError
.
I also reran with source /global/cfs/cdirs/e3sm/software/anaconda_envs/load_latest_e3sm_unified.sh
and source /global/cfs/cdirs/e3sm/software/anaconda_envs/load_latest_e3sm_unified_mpich.sh
. I got the same results (MPI error and success, respectively), as I mentioned in my previous comment.
Okay, so @forsyth2, these errors were with nompi
for most packages but for mpich
and mpi4py
included in the environment, right? It seems like maybe cdms2
isn't happy when you don't have MPI versions of some library or other when you do have mpi4py
. It's not really possible for me to figure out by process of elimination which package needs to be MPI because most packages have to match up -- esmf
, esmpy
, libnetcdf
, hdf5
, etc. must all either be nompi
or all mpich
. The error message wasn't very helpful to me. It almost looks like it's saying the file it's trying to open doesn't exist. Can you at least verify that that's not the case?
@xylar The error was produced by the fourth environment you gave (Under "What to investigate?"). Apparently that file doesn't exist -- in fact, the directory /global/homes/f/forsyth/.conda/envs/e3sm_diags_env_6/share/e3sm_diags/
doesn't seem to exist. I'm not sure why CDMS doesn't fail when source /global/cfs/cdirs/e3sm/software/anaconda_envs/load_latest_e3sm_unified_mpich.sh
is used though.
@forsyth2 and @chengzhuzhang, okay, this seems outside my area of expertise but if I were you, I would investigate further why the expected files aren't being created in the "fourth" environment, the one with mpi4py
, mpich
and most libraries nompi
.
So far, it seems like a feasible solution might be making the mpich
, rather than the nompi
version of e3sm-unified
the default. I'm hesitant to do this, though, because I think I've had trouble actually running MPI jobs using cond-forge mpich
on Cori compute nodes in the past and I would be surprised if that has changed. If we do decide to make mpich
the default, we would need to do some careful testing of not just E3SM_Diags but any packages that use MPI to make sure they work as expected on compute nodes on all the supported systems.
I still think it's worth investigating, maybe with help form CDAT folks, why things go wrong with an enviornment with mpi4py
and mpich
unless we have mpich
versions of all the packages. For other packages, this should be okay as long as they know not to use MPI. I think this might go back to what I pointed out above, that CDAT tries to figure out whether to use MPI or now by seeing if mpi4py
is installed. But The default E3SM-Unified installation tries to have all the libraries nompi
and it only includes mpi4py
because it's a required dependency of a package called ilamb
. Let me know if a video chat next week on this would be helpful.
@xylar Thank you,xylar. I finally get more time to also work on this issue. One thing I'm not clear with is that why this issue only be seen on haswell or knl, but not login node? Any insight... Also it seems like running on Compy nodes was not an issue.
@chengzhuzhang, so far, the symptoms point to an incompatibility between conda-forge OpenMPI and Cori compute nodes that doesn't exist for MPICH. I am surprised that MPICH seems to work on Cori nodes so far. That hadn't been my experience in the past. The MPI settings must be different on Compy, such that OpenMPI works on compute nodes. Similarly, Cori's login nodes are a different CPU type and maybe a different version of the OS. Their MPI configuration is also almost certainly different from the compute nodes. Any or all of these could play a role. I don't have the expertise to have a good, concrete explanation, so it points to us needing to test quite a lot more than we have in the past to make sure E3SM_Diags works on all systems on both login and compute nodes.
@xylar I did some more investigation. It is almost certain that CDMS or its dependencies caused the problem. The issue can be reproduced, simply by
salloc --nodes=1 --partition=debug --time=00:30:00 -C haswell
source /global/cfs/cdirs/e3sm/software/anaconda_envs/load_latest_e3sm_unified.sh
python -c "import cdms2"
Then I tried you instruction to generate dev env that has both cdms2
and mpi4py
. The same MPI_init_thread error only occurred when mip_openmpi
builts of libnetcdf and other libraries are in the environment. When mpi_mpich
builts were forced, the problem is gone.
Then I thought, I should try my luck with the newly released cdat
2.8.1, because i have the impression that it works with the mpi_openmpi
version of libnetcdf, but not mpi_mpich
variants. Unfortunately, the same error occur on cori haswell
, with cdms
and openmpi
.
I don't know if as you pointed out, https://github.com/CDAT/cdms/blob/master/Lib/tvariable.py#L26-L32. could be the cause.
I used following to generate the env:
conda create -y -n cdms_v821_mpi4py_openmpi_py37 -c cdat/label/v8.2.1 -c conda-forge -c defaults python=3.7 mpi4py cdms2 "libnetcdf=*=mpi_openmpi_*" "esmf=*=mpi_openmpi_*" "esmpy=*=mpi_openmpi_*"
The error can be reproduced using:
salloc --nodes=1 --partition=debug --time=00:30:00 -C haswell
conda activate cdms_v821_mpi4py_openmpi_py37
python -c "import cdms2"
@muryanto1 @jasonb5 I think you have been dealing with compatibility issues of cdms and mpi variants. It would be really appreciated if you could provide some insight with this issue. Thanks!
@chengzhuzhang, that's helpful information. Do recall discussing MPI with @muryanto1 and something about cdms2
not being compatible with MPICH. I think I mentioned that might be a problem for us. My guess is that the issue with OpenMPI isn't necessarily a cdms2
issue in this case. It may be that OpenMPI just doesn't work on Cori login nodes at all, and cdms2
just happens to be importing it.
As I said above, I'm hesitant to make the mpich
environment the default, though this may be our best bet in the end. But I could switch from openmpi
to mpich
in the default environment, with most packages being the nompi
variants. Could you and @forsyth2 try to debut the errors in the environment with this setup?
conda create -y -n cdms_v82_mpi4py_nompi_py37 -c conda-forge -c defaults -c cdat/label/v82 python=3.7 mpi4py mpich cdms2 "libnetcdf=*=nompi_*" "esmf=*=nompi_*" "esmpy=*=nompi_*"
@xylar with the mpich
and nompi
variants of the packages, using:
conda create -y -n cdms_v82_mpi4py_nompi_py37 -c conda-forge -c defaults -c cdat/label/v82 python=3.7 mpi4py mpich cdms2 "libnetcdf=*=nompi_*" "esmf=*=nompi_*" "esmpy=*=nompi_*"
There was no MPI_init_thread error when importing cdms2.
And using the same combination to generate e3sm_diags dev env, with
conda create -y -n e3sm_diag_mpi4py_nompi_py37 -c cdat/label/v82 -c conda-forge -c defaults python=3.7 "cdp>=1.6.0" "vcs>=8.2" "vtk-cdat=8.2.0.8.2" "vcsaddons>=8.2" "dv3d>=8.2" "cdms2>=3.1.4" "cdutil>=8.2" "genutil>=8.2" "cdtime>=3.1.2" numpy matplotlib "cartopy>=0.18.0" beautifulsoup4 lxml mpi4py mpich "libnetcdf=*=nompi_*" "esmf=*=nompi_*" "esmpy=*=nompi_*" "hdf5=*=nompi_*" "dask=2.15.0"
I had e3sm_diags run sucessfully on haswell
.
@forsyth2 I was not able to reproduce the CDMSError
you had with this environment. It almost seem like an installation problem. Could you maybe re-install e3sm-diags
and try run again?
@chengzhuzhang
conda create -y -n cdms_v82_mpi4py_nompi_py37 -c conda-forge -c defaults -c cdat/label/v82 python=3.7 mpi4py mpich cdms2 "libnetcdf=*=nompi_*" "esmf=*=nompi_*" "esmpy=*=nompi_*"
salloc --nodes=1 --partition=debug --time=00:30:00 -C haswell
conda activate cdms_v82_mpi4py_nompi_py37
python -c "import cdms2"
The above does not produce an error.
cd /e3sm_diags/tests/system
conda create -y -n e3sm_diag_mpi4py_nompi_py37 -c cdat/label/v82 -c conda-forge -c defaults python=3.7 "cdp>=1.6.0" "vcs>=8.2" "vtk-cdat=8.2.0.8.2" "vcsaddons>=8.2" "dv3d>=8.2" "cdms2>=3.1.4" "cdutil>=8.2" "genutil>=8.2" "cdtime>=3.1.2" numpy matplotlib "cartopy>=0.18.0" beautifulsoup4 lxml mpi4py mpich "libnetcdf=*=nompi_*" "esmf=*=nompi_*" "esmpy=*=nompi_*" "hdf5=*=nompi_*" "dask=2.15.0"
salloc --nodes=1 --partition=debug --time=00:30:00 -C haswell
conda activate e3sm_diag_mpi4py_nompi_py37
python all_sets.py -d all_sets.cfg
The above produces the CDMS error
(cdms2.error.CDMSError: Cannot open file /global/homes/f/forsyth/.conda/envs/e3sm_diag_mpi4py_nompi_py37/share/e3sm_diags/acme_ne30_ocean_land_mask.nc (Variable not found)
)
again.
Considering I got the error again when using the steps above and you didn't, it does seem like it's a problem on my end, but I'm not sure what's going on.
Did you manage to run on Compy? I couldn't load the environment.
I ran source /compyfs/software/e3sm-unified/load_latest_e3sm_unified.sh
(from https://e3sm-project.github.io/e3sm_diags/docs/html/quickguides/quick-guide-compy.html), but that produces
/compyfs/software/e3sm-unified/load_latest_e3sm_unified.sh: No such file or directory
. Is there an updated directory
for the e3sm_unified
script? If so, we should update the docs.
@forsyth2 Hey Ryan, since the conda create
command only create the development env for e3sm_diags. It needs to be installed using pip install .
from the local github repo. And then try run it. Hope this could fix the problem.
Regarding to compy, I saw that Xylar has updated the activation path on compy with /share/apps/E3SM/conda_envs/
https://acme-climate.atlassian.net/wiki/spaces/EIDMG/pages/780271950/Diagnostics+and+Analysis+Quickstart . Would you give it another try and fix our docs accordingly? Thanks!
@chengzhuzhang Thank you! That must have been what was causing the error. It runs successfully now. I completely forgot about the pip install .
step; I guess I was still thinking of the unified environment scripts which actually do load E3SM.
salloc --nodes=1 --time=00:30:00
source /share/apps/E3SM/conda_envs/load_latest_e3sm_unified.sh
python all_sets.py -d all_sets.cfg
The above runs successfully on Compy.
Created #330 to update the Compy paths for E3SM Unified.
MPI_init_thread erorr when running e3sm_diags within e3sm_unified 1.3.1.1 on cori knl and haswell. Error message as below:
It was okay running on cori login. Also running standalone e3sm_diags env is also fine.