edhartnett commented 6 years ago

Introduction

This ticket described a plan to include the functions of the PIO library into the netCDF library, so that it may be (optionally) available to all netCDF HPC (High Performance Computing) users. The HPC community is a core component of the Unidata community, and better support of modern supercomputers will allow netCDF to better server these users.

The Parallel I/O library (https://github.com/) as developed at NCAR. It provides advances HPC I/O functionality for netCDF codes, with a netCDF API. The PIO library allows HPC users to make the most of their computational hardware, without waiting on I/O more than is absolutely essential.

The goal of this effort is to make the PIO functionality available to netCDF users, through the NetCDF API.

Main Features

Allows user to specify an I/O component, composed of N processors, which performs all I/O.
Allows users to specify one or more computation components, each containing a specified number of processors.
When computation components call netCDF functions, I/O is actually performed by I/O component.
NetCDF variable data may be distributed across processors in a computational component so that each processor only needs to manage its local data space.
Buffering of I/O in memory of computational components, and memory of I/O components.
Asynchronous writes of data from computational components.

Computational Components and I/O Component

In async mode, it is possible for the user to define several computational components, and one (shared) I/O component.

Each component can have an arbitrary number of processors.

In computational components, netCDF calls actually send data to the I/O component, which buffers data and handles disk I/O.

For example, a machine of 500K processors can assign 1K processors to I/O, 250K processors to an atmospheric model, 100K processors to an ocean model, 10K processors to a space weather model, etc., until all 500K processors are used up. Each of the models can do netCDF I/O as usual, but all actual I/O will be channeled through the I/O component. Data are buffered at all stages to improve performance.

Configuration and Build

The PIO functionality is only included if netCDF is built with --enable-pio. It makes no sense to use PIO on systems without multiple cores, but this is not enforced by the build, and tests will build and pass even on a single core.

The PIO functionality is available through the netCDF internal dispatch system. This allows the use of the netCDF API by various sublibraries (netCDF4, HDF4, pnetCDF, etc.) The PIO functionality is accessed in the same way.

The code for the integrated PIO functionality is in subdirectory libpio.

Testing of PIO

The pio_test directory contains tests for PIO. These tests will only be run of --enable-parallel-tests is specified.

Using PIO

In order to use PIO, the user must include the NC_PIO flag in the mode of nc_create/nc_open.

Additions to the NetCDF Data Model

PIO introduces two new objects to the netCDF data model:

the iosystem, which describes the entire current hardware arrangement (how many tasks, which ones to be I/O tasks, which ones to be computational components). Identified with an iosystemid.
the decompositon, which describes how the data from one or more vars is to be distributed across processors for distributed read/writes. Identified with ioid.

New Functions for the API

Several new functions need to be added to the netCDF API to support PIO functionality. However, it is not necessary to add them to the netCDF dispatch table, since other netCDF sub-libraries will never need to implement these functions.

Initialization

There are two new PIO initialization functions, one for async, one for non-async (The names of these functions will be changed to match the netCDF API). A finalize call must be called at the end of all processing.

    int PIOc_Init_Intercomm(int component_count, MPI_Comm peer_comm, MPI_Comm *comp_comms,
                            MPI_Comm io_comm, int *iosysidp);
    int PIOc_init_async(MPI_Comm world, int num_io_procs, int *io_proc_list, int component_count,
                        int *num_procs_per_comp, int **proc_list, MPI_Comm *io_comm, MPI_Comm *comp_comm,
                        int rearranger, int *iosysidp);

   int PIOc_finalize(int iosysid);

Decompositions

Decompositions define how a netCDF variable is distributed across processors on the supercomputer.

    /* Init decomposition with 0-based compmap array. */
    int PIOc_init_decomp(int iosysid, int pio_type, int ndims, const int *gdimlen, int maplen,
                         const PIO_Offset *compmap, int *ioidp, int rearranger,
                         const PIO_Offset *iostart, const PIO_Offset *iocount);

    /* Free resources associated with a decomposition. */
    int PIOc_freedecomp(int iosysid, int ioid);

For convenience there are also two functions to write/read a netCDF file which records the decomposition information. Use of decomposition files is helpful for debugging, but not generally used in real processing.

    /* Write a decomposition file using netCDF. */
    int PIOc_write_nc_decomp(int iosysid, const char *filename, int cmode, int ioid,
                             char *title, char *history, int fortran_order);

    /* Read a netCDF decomposition file. */
    int PIOc_read_nc_decomp(int iosysid, const char *filename, int *ioid, MPI_Comm comm,
                            int pio_type, char *title, char *history, int *fortran_order);

Record Reads/Writes on Distributed Arrays

    int PIOc_advanceframe(int ncid, int varid);
    int PIOc_setframe(int ncid, int varid, int frame);
    int PIOc_write_darray(int ncid, int varid, int ioid, PIO_Offset arraylen, void *array,
                          void *fillvalue);
    int PIOc_read_darray(int ncid, int varid, int ioid, PIO_Offset arraylen, void *array);

There are read/write functions to read/write a record of a distibuted array. (A record is defined by writing 1 element of the first dimension, and all elements of subsequent dimensions. In a classic netCDF file with an unlimited dimension, this is a record. But with PIO, the first dimension does not have to be unlimited.)

Schedule

I anticipate the first version of this integration, supporting netCDF classic files, will be ready before the end of 2017. If so, it can be included in the 4.5.1 release of netCDF.

Work Details

This work can be followed in detail on the GitHub repo: https://github.com/edhartnett/netcdf-c

DennisHeimbigner commented 6 years ago

A question: is e.g. PIOc_read_nc_decomp intended to be called by the user or will it be wrapped by some existing netcdf-API function(s).

edhartnett commented 6 years ago

The read/write decomp functions read/write a special netCDF file that contains the decomposition information for PIO. So you can create a decomposition, and save it to file, and use it over and over.

These functions will be wrapped in functions to make them look more netCDF-like. (like nc_put_decomp()?)

DennisHeimbigner commented 6 years ago

Would this work? The decomposition information can be passed to create and open methods in the dispatch table using the "parameters" argument and then stored in the NC.dispatchdata field. Then when the vara/vars methods in the dispatch table are called for PIO, that code can extract the decomp info and do the proper thing with it. to invoke as needed PIOc_read/write_nc_decomp.

edhartnett commented 6 years ago

What I would like to do is come in and brief you and Ward on PIO and how it works. There are some details, but it is fitting into the netCDF API very well.

In terms of decompositions, it would not be unusual to have several in use at one time. The user initializes them in an init_decomp() call. The read/write decomp need never be called, they are just convenience functions to allow easy recording and communication of what decomposition was used. (Doesn't matter for the contents of the data, just how the data are distributed over a particular hardware configuration.)

PIO introduced two new objects to the netCDF data model:

the iosystem, which describes the entire current hardware arrangement (how many tasks, which ones to be I/O tasks, which ones to be computational components). Identified with an iosystemid.
the decompositon, which describes how the data from one or more vars is to be distributed across processors for distributed read/writes. Identified with ioid.

The iosystemid is pretty easy to hide, as long as there is just one (and that is typical).

The decomposition IDs are more plentiful and must be directly controlled by the user. A variable may have a different read and write decomposition (to manage halo effects, for example).

DennisHeimbigner commented 6 years ago

Well, as you know, you can create an NC_PIO_INFO structure per-open/created-file and stash whatever you need into it, including decompositions and iosystemids and other pio specific objects.

When you say

The decomposition IDs are more plentiful and must be directly controlled by the user. A variable may have a different read and write decomposition ... what do you mean by directly controlled by the user? Given sufficient info at file open/create time can the proper decomp to use be automatically determined when later needed?

edhartnett commented 6 years ago

Unfortunately, the decomps cannot be automatically determined. They are how PIO is able to be tuned to the specific hardware for optimum performance.

I will get my base implementation together, and if we can come up with any way to better hide the decomps, I am open to it.

DennisHeimbigner commented 6 years ago

Sorry, I was not clear. I am not asking to automatically determine the decomps, but rather to ask if which decomp(s) to use can be determined from information such as the complete set of available decomps + specific variable to write plus slab information. In order words, given the usual argument to vara, can I determine which decomp to use?

edhartnett commented 6 years ago

It certainly would be possible to associate a default decomp for each var read and write. But note that the arguments to vara change anyway. Note that distributed arrays only work with a "record" (which is a looser definition of record than you are used to.)

So there is no need for the user to specify start/count. An entire record is read/written at one time. The processor only gets/puts the local portion of the global array, based on the decomposition.

edhartnett commented 6 years ago

I have taken down my PR with this feature because I now see a better way to implement this, one that will not require taking PIO code into the netCDF library codebase. Instead, it can be treated like pnetcdf, as an optional interface that can be turned on at configure, used with NC_PIO in the mode flag, but does not involve PIO code in netCDF. Just some wrappers, as with pnetcdf.

This requires some work on PIO first. When I get that done I will circle around again and put up a PIO PR.

edhartnett commented 6 years ago

I think this functionality may best be brought to users via a user-dispatch library, or via HPC NetCDF-C, or through some other mechanism. Interested users should contact me directly.

edhartnett commented 5 years ago

OK, this capability has been added to PIO, and it works great! ;-)

Unidata / netcdf-c

Add multi-level parallelism, async writes, and other HPC features from PIO library #555

Introduction

Main Features

Computational Components and I/O Component

Configuration and Build

Testing of PIO

Using PIO

Additions to the NetCDF Data Model

New Functions for the API

Initialization

Decompositions

Record Reads/Writes on Distributed Arrays

Schedule

Work Details