Closed guidocioni closed 3 years ago
@guidocioni, not a stupid question at all. It's a bit tricky. Assuming you have added conda-forge
to your channels (and set channel order to strict, see https://conda-forge.org/#about), this should work:
conda create -y -n test python=3.8 "netcdf4=*=mpi_openmpi_*"
Note that you are selecting any compatible version after the first =
and saying you want a specific build where the build string starts with mpi_openmpi_
with the second =
. The quotation marks ("
) are necessary at least under Linux to prevent the command-line parser from getting confused.
If you are trying to include a specific MPI build of netcdf4
in a recipe, let me know. That it a little trickier. See https://conda-forge.org/docs/maintainer/knowledge_base.html#message-passing-interface-mpi but feel free to ask more questions if you need to.
Wow thanks for the info! I can confirm this is working on MacOS and Ubuntu.
Just for info, what would be the difference in the day-to-day use? I mainly use the dependency to load nc datasets in xarray
. Would the xarray engine benefit from parallelization in open_dataset
on a machine with many cores or the performance would be almost identical? Again, this may be another stupid question :) I have a suite of scripts that programmatically download, process and plot data from NWP models and I'm constantly trying to optimize the loading/processing phase of nc files in Python. As of now I'm almost always using open_mfdataset
with either scipy
or netcdf4
engine.
Thanks again
@guidocioni, glad I could help.
Just for info, what would be the difference in the day-to-day use?
I can't be sure but I think netcdf4
with MPI support would be useful if you were trying to write NetCDF files out in parallel, perhaps using mpi4py
to perform other operations in parallel in python as well. To be honest, I always use the nompi
version of netcdf4
even when I'm using mpich
for other things. I found I needed to run any python script with mpirun
if it imported NetCDF4
or xarray
even if I wasn't trying to use MPI parallelism in NetCDF4 itself. I found this really annoying. So, as I said, I don't use the MPI version.
I mainly use the dependency to load nc datasets in xarray. Would the xarray engine benefit from parallelization in open_dataset on a machine with many cores or the performance would be almost identical? Again, this may be another stupid question :)
Not at all! That's a really good question. And my understanding is that xarray
will not benefit at all from the MPI version of netcdf4
. It uses parallelism via the dask
and distributed
packages, which use a different type of parallelism. dask
may be able to use MPI, I'm not sure about that, but I use it only with thread parallelism myself. I'm not sure, even if dask
is using MPI parallelism, if you would want the MPI version of netcdf4
, since xarray
may handle the I/O to NetCDF in a fancy way on its own. That'd be something to ask the xarray
list or hunt around on Stackoverflow.
I have a suite of scripts that programmatically download, process and plot data from NWP models and I'm constantly trying to optimize the loading/processing phase of nc files in Python. As of now I'm almost always using open_mfdataset with either scipy or netcdf4 engine.
I think you'll want to get tips from the xarray
folks to help optimize that process. I don't think parallel NetCDF4 is going to help.
Closing this issue for now but feel free to re-open or ask more questions here as the need arises.
This may be a really stupid question but how can I force the install of netcdf4 with mpi support? I can see in the list of packages
mpi_openmpi_py38hc52dea8_0
but cannot instal specifically this version and every time that I install or upgradenetcdf4
thenompi
version is downloaded, even though I haveopenmpi
installed in my environment.