How to install netcdf4 with mpi support?

guidocioni commented 3 years ago

This may be a really stupid question but how can I force the install of netcdf4 with mpi support? I can see in the list of packages mpi_openmpi_py38hc52dea8_0 but cannot instal specifically this version and every time that I install or upgrade netcdf4 the nompi version is downloaded, even though I have openmpi installed in my environment.

xylar commented 3 years ago

@guidocioni, not a stupid question at all. It's a bit tricky. Assuming you have added conda-forge to your channels (and set channel order to strict, see https://conda-forge.org/#about), this should work:

conda create -y -n test python=3.8 "netcdf4=*=mpi_openmpi_*"

Note that you are selecting any compatible version after the first = and saying you want a specific build where the build string starts with mpi_openmpi_ with the second =. The quotation marks (") are necessary at least under Linux to prevent the command-line parser from getting confused.

If you are trying to include a specific MPI build of netcdf4 in a recipe, let me know. That it a little trickier. See https://conda-forge.org/docs/maintainer/knowledge_base.html#message-passing-interface-mpi but feel free to ask more questions if you need to.

guidocioni commented 3 years ago

Wow thanks for the info! I can confirm this is working on MacOS and Ubuntu.

Just for info, what would be the difference in the day-to-day use? I mainly use the dependency to load nc datasets in xarray. Would the xarray engine benefit from parallelization in open_dataset on a machine with many cores or the performance would be almost identical? Again, this may be another stupid question :) I have a suite of scripts that programmatically download, process and plot data from NWP models and I'm constantly trying to optimize the loading/processing phase of nc files in Python. As of now I'm almost always using open_mfdataset with either scipy or netcdf4 engine.

Thanks again

xylar commented 3 years ago

@guidocioni, glad I could help.

Just for info, what would be the difference in the day-to-day use?

I can't be sure but I think netcdf4 with MPI support would be useful if you were trying to write NetCDF files out in parallel, perhaps using mpi4py to perform other operations in parallel in python as well. To be honest, I always use the nompi version of netcdf4 even when I'm using mpich for other things. I found I needed to run any python script with mpirun if it imported NetCDF4 or xarray even if I wasn't trying to use MPI parallelism in NetCDF4 itself. I found this really annoying. So, as I said, I don't use the MPI version.

I mainly use the dependency to load nc datasets in xarray. Would the xarray engine benefit from parallelization in open_dataset on a machine with many cores or the performance would be almost identical? Again, this may be another stupid question :)

Not at all! That's a really good question. And my understanding is that xarray will not benefit at all from the MPI version of netcdf4. It uses parallelism via the dask and distributed packages, which use a different type of parallelism. dask may be able to use MPI, I'm not sure about that, but I use it only with thread parallelism myself. I'm not sure, even if dask is using MPI parallelism, if you would want the MPI version of netcdf4, since xarray may handle the I/O to NetCDF in a fancy way on its own. That'd be something to ask the xarray list or hunt around on Stackoverflow.

I have a suite of scripts that programmatically download, process and plot data from NWP models and I'm constantly trying to optimize the loading/processing phase of nc files in Python. As of now I'm almost always using open_mfdataset with either scipy or netcdf4 engine.

I think you'll want to get tips from the xarray folks to help optimize that process. I don't think parallel NetCDF4 is going to help.

Closing this issue for now but feel free to re-open or ask more questions here as the need arises.

conda-forge / netcdf4-feedstock

How to install netcdf4 with mpi support? #111