E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
351 stars 360 forks source link

Scorpio fails with NULL strides for puts/gets with the NetCDF iotype #3821

Open aldivi opened 4 years ago

aldivi commented 4 years ago

In E3SM/externals/scorpio/src/clib/pio_getput_int.c, the functions PIOc_put_vars_tc and PIOc_get_vars_tc assume that the nc_putvars### and nc_getvars### functions will accept NULL pointers, but in this case they do not. These functions fail when (at least) the stride pointers are NULL. I had to allocate and set values to 1 and later deallocate the stride pointer so that I could read land data files.

This has been done for pnetcdf files, but only for multidimensional files and only in PIOc_put_vars_tc() (see lines 1456-1520). This should be extended to all pnetcdf and netcdf reads/writes.

This fail upon NULL pointers may depend on the netcdf library version (this problem occurred with version 4.6.2).

aldivi commented 4 years ago

The scorpio hash i used is: 5fefc94aa1b884763c69e144bf0979ae01aa6be0

The E3SM master hash I used is: 3a93b5bca7db7440f237941f1a366b9dacaf4ad2

jayeshkrishna commented 4 years ago

I modified the title to align with the description of the issue.

jayeshkrishna commented 4 years ago

@aldivi : Can you add a snippet of code that shows the call that failed (You don't need to provide a test case, just the Scorpio API call and a brief description of the values of the arguments in your call)?

aldivi commented 4 years ago

The call sequence that fails for for reading data starts with trying to read a land mask (last land log message):

Attempting to read global land mask from /home/elmuser/data/share/domains/domain.clm/domain.lnd.1x1pt-brazil_navy.090715.nc

The ELM call is in clm_initializeMod.F90 (snippet starts at line 134):

! ------------------------------------------------------------------------
! Read in global land grid and land mask (amask)- needed to set decomposition
! ------------------------------------------------------------------------

! global memory for amask is allocate in surfrd_get_glomask - must be
! deallocated below
if (masterproc) then
   write(iulog,*) 'Attempting to read global land mask from ',trim(fatmlndfrc)
   call shr_sys_flush(iulog)
endif
call surfrd_get_globmask(filename=fatmlndfrc, mask=amask, ni=ni, nj=nj)

The surfrd_get_globmask() function is in surfrfMod.F90, and calls ncd_io() (snippet starts at line 103):

if (isgrid2d) then allocate(idata2d(ni,nj)) idata2d(:,:) = 1 call ncd_io(ncid=ncid, varname='LANDMASK', data=idata2d, flag='read', readvar=readvar) if (.not. readvar) then call ncd_io(ncid=ncid, varname='mask', data=idata2d, flag='read', readvar=readvar) end if if (readvar) then do j = 1,nj do i = 1,ni n = (j-1)*ni + i
mask(n) = idata2d(i,j) enddo enddo end if deallocate(idata2d) else call ncd_io(ncid=ncid, varname='LANDMASK', data=mask, flag='read', readvar=readvar) if (.not. readvar) then call ncd_io(ncid=ncid, varname='mask', data=mask, flag='read', readvar=readvar) end if end if

There appear to be several layers between this and the scorpio C calls, and I am not able to trace through them, partially because there are several different pio modules.

I am fairly sure that the call is to read a double PIOc_get_vara_double() because that is the data type for the variables in this file. This calls PIOc_get_vars_tc(), and the code that fails starts on line 743, and it fails because no stride is present and a NULL pointer is passed to nc_get_vars_double():

    if (file->iotype != PIO_IOTYPE_PNETCDF && file->do_io)
        switch(xtype)
        {

ifdef _NETCDF

        case NC_BYTE:
            ierr = nc_get_vars_schar(file->fh, varid, (size_t *)start, (size_t *)count,
                                     (ptrdiff_t *)stride, buf);
            break;
        case NC_CHAR:
            ierr = nc_get_vars_text(file->fh, varid, (size_t *)start, (size_t *)count,
                                    (ptrdiff_t *)stride, buf);
            break;
        case NC_SHORT:
            ierr = nc_get_vars_short(file->fh, varid, (size_t *)start, (size_t *)count,
                                     (ptrdiff_t *)stride, buf);
            break;
        case NC_INT:
            ierr = nc_get_vars_int(file->fh, varid, (size_t *)start, (size_t *)count,
                                   (ptrdiff_t *)stride, buf);
            break;
        case PIO_LONG_INTERNAL:
            ierr = nc_get_vars_long(file->fh, varid, (size_t *)start, (size_t *)count,
                                    (ptrdiff_t *)stride, buf);
            break;
        case NC_FLOAT:
            ierr = nc_get_vars_float(file->fh, varid, (size_t *)start, (size_t *)count,
                                     (ptrdiff_t *)stride, buf);
            break;
        case NC_DOUBLE:
            ierr = nc_get_vars_double(file->fh, varid, (size_t *)start, (size_t *)count,
                                      (ptrdiff_t *)stride, buf);
            break;

endif

ifdef _NETCDF4

        case NC_UBYTE:
            ierr = nc_get_vars_uchar(file->fh, varid, (size_t *)start, (size_t *)count,
                                     (ptrdiff_t *)stride, buf);
            break;
        case NC_USHORT:
            ierr = nc_get_vars_ushort(file->fh, varid, (size_t *)start, (size_t *)count,
                                      (ptrdiff_t *)stride, buf);
            break;
        case NC_UINT:
            ierr = nc_get_vars_uint(file->fh, varid, (size_t *)start, (size_t *)count,
                                    (ptrdiff_t *)stride, buf);
            break;
        case NC_INT64:
            LOG((3, "about to call nc_get_vars_longlong"));
            ierr = nc_get_vars_longlong(file->fh, varid, (size_t *)start, (size_t *)count,
                                        (ptrdiff_t *)stride, buf);
            break;
        case NC_UINT64:
            ierr = nc_get_vars_ulonglong(file->fh, varid, (size_t *)start, (size_t *)count,
                                         (ptrdiff_t *)stride, buf);
            break;
            /* case NC_STRING: */
            /*      ierr = nc_get_vars_string(file->fh, varid, (size_t *)start, (size_t *)count, */
            /*                                (ptrdiff_t *)stride, (void *)buf); */
            /*      break; */

endif / _NETCDF4 /

        default:
            return pio_err(ios, file, PIO_EBADTYPE, __FILE__, __LINE__,
                            "Reading variable (%s, varid=%d) from file (%s, ncid=%d) failed. Unsupported variable type (type=%x)", pio_get_vname_from_file(file, varid), varid, pio_get_fname_from_file(file), ncid, xtype);
        }
}

E3SM fails with this error in the log:

PIO: FATAL ERROR: Aborting... FATAL ERROR: NetCDF: Illegal stride (/home/elmuser/E3SM/externals/scorpio/src/clib/pio_getput_int.c: 809)

I just noticed that the pio2 code that is outside of scorpio incorporates the fake_stride pointer to deal with this issue.

I did not get this error on Cori, and I am not sure which netcdf version E3SM uses there. But if I module load cray-netcdf, then version 4.6.3 is loaded.

jayeshkrishna commented 4 years ago

Thanks, looks like an issue with NetCDF 4.6.2. As per the NetCDF documentation NULL strides should not return an illegal stride error. However we might need to handle this issue (even if its a bug in a specific version of the NetCDF lib) and add a fix in Scorpio.