CDAT / cdms

8 stars 10 forks source link

Problems with reading "big" arrays (>8.1Gb) #383

Closed durack1 closed 3 years ago

durack1 commented 4 years ago

Describe the bug I have hit a reproducible error where big arrays (>8.1Gb) are not read correctly, rather with a zero array (rather than real numbers) being returned. I was a little puzzled by this error, and got talking with @painter1 who also had this problem and reported it back via email in May 2019. It turns out that the issue is with arrays greater than 8.1Gb, with the original error a bug with libnetcdf versions for big variables (from @painter1's notes/emails). @dnadeau4 and @doutriaux1 may recall some of the specific details about this. I note I may not be using the latest versions of libraries below.

To Reproduce Steps to reproduce the behavior:

  1. Install CDAT with: cdms2-3.1.4-py37ha6f5e91_3, libnetcdf-4.6.2-h303dfb8_1003, netcdf-fortran-4.4.5-h0789656_1004
  2. Execute the code attached (which reads larger and larger arrays)
  3. Watch as some summary stats go from real numbers to 0's when the arrays being read are >8Gb, which for the demo below happens at year 1989 (3rd step of the loop) when 26 years of data are being read (with the model having a vert/horiz grid of 60 vertical levels, 384 lat, 320 lon).

Expected behavior Big arrays should be read validly, returning non-zero arrays

Desktop (please complete the following information):

The code to reproduce this:

# imports
import sys
import cdat_info
import cdms2 as cdm
import numpy as np
from socket import gethostname

#%% Define function
def calcAve(var):
    print('type(var);',type(var),'; var.shape:',var.shape)
    # Start querying stat functions
    print('var.min():'.ljust(21),var.min())
    print('var.max():'.ljust(21),var.max())
    print('np.ma.mean(var.data):',np.ma.mean(var.data)) ; # Not mask aware
    # Problem transientVariable.mean() function
    #print('var.mean():'.ljust(21),var.mean())
    print('-----')

#%% Load subset of variable
f = ['/p/css03/esgf_publish/CMIP6/CMIP/NCAR/CESM2/historical/r1i1p1f1/Omon/so/gn/v20190308/so_Omon_CESM2_historical_r1i1p1f1_gn_185001-201412.nc']
# Try building up arrays stepping in a single year
times = np.arange(1991,1984,-1)
print('host:',gethostname())
print('Python version:',sys.version)
print('cdat env:',sys.executable.split('/')[5])
print('cdat version:',cdat_info.version()[0])
print('*****')
for timeSlot in times:
    for filePath in f:
        fH = cdm.open(filePath)
        print('filePath:',filePath.split('/')[-1])
        # Loop through single years
        start = timeSlot ; end = 2014
        print('times:',start,end,'; total years:',(end-start)+1)
        d1 = fH('so',time=(str(start),str(end)))
        print("Array size: %d Mb" % ( (d1.size * d1.itemsize) / (1024*1024) ) )
        calcAve(d1)
        del(d1)
        fH.close()
    print('----- -----')

@pochedls @muryanto1 @downiec @jasonb5 @gabdulla @gleckler1 @lee1043 ping

muryanto1 commented 4 years ago

@durack1 I tried running the code with latest cdms2 in cdat/label/nightly and latest libnetcdf, and was able to reproduce. ` cdat/label/nightly/linux-64::cdms2-3.1.4.2020.01.14.21.45.gee3f0ff-py37h34d3450_0 libnetcdf 4.7.3 nompi_h9f9fd6a_101 conda-forge netcdf-fortran 4.5.2 nompi_h09cde99_103 conda-forge

$ curl https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -o Miniconda3-latest-MacOSX-x86_64.sh

$ source miniconda3/etc/profile.d/conda.sh $ conda activate base $ conda activate nightly_py3.7

on aims1: $ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O Miniconda3-latest-Linux-x86_64.sh $ bash ./Miniconda3-latest-Linux-x86_64.sh -b -p miniconda3 $ source miniconda3/etc/profile.d/conda.sh $ conda activate base $ conda config --set channel_priority strict $ conda config --add channel conda-forge $ conda config --add channels cdat/label/nightly

$ conda create -n nightly_py3.7 cdat mesalib easydev nbsphinx myproxyclient testsrunner coverage pytest "python=3.7" -c cdat/label/nightly -c conda-forge $ conda activate nightly_py3.7

# I put your code into a file: test_big_array.py
$ python ./test_big_array.py
host: aims1.llnl.gov
Python version: 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 22:33:48) 
[GCC 7.3.0]
cdat env: miniconda3
cdat version: 8
*****
filePath: so_Omon_CESM2_historical_r1i1p1f1_gn_185001-201412.nc
times: 1991 2014 ; total years: 24
Array size: 7762 Mb
type(var); <class 'cdms2.tvariable.TransientVariable'> ; var.shape: (276, 60, 384, 320)
var.min():            6.940156
var.max():            48.25107
np.ma.mean(var.data): 4.2389736e+19
-----
----- -----
filePath: so_Omon_CESM2_historical_r1i1p1f1_gn_185001-201412.nc
times: 1990 2014 ; total years: 25
Array size: 8100 Mb
type(var); <class 'cdms2.tvariable.TransientVariable'> ; var.shape: (288, 60, 384, 320)
var.min():            6.940156
var.max():            48.25107
np.ma.mean(var.data): 4.239067e+19
-----
----- -----
filePath: so_Omon_CESM2_historical_r1i1p1f1_gn_185001-201412.nc
times: 1989 2014 ; total years: 26
Array size: 8437 Mb
type(var); <class 'cdms2.tvariable.TransientVariable'> ; var.shape: (300, 60, 384, 320)
var.min():            0.0
var.max():            0.0
np.ma.mean(var.data): 0.0
-----
----- -----
filePath: so_Omon_CESM2_historical_r1i1p1f1_gn_185001-201412.nc
times: 1988 2014 ; total years: 27
Array size: 8775 Mb
type(var); <class 'cdms2.tvariable.TransientVariable'> ; var.shape: (312, 60, 384, 320)
var.min():            0.0
var.max():            0.0
np.ma.mean(var.data): 0.0
-----
----- -----
filePath: so_Omon_CESM2_historical_r1i1p1f1_gn_185001-201412.nc
times: 1987 2014 ; total years: 28
Array size: 9112 Mb
type(var); <class 'cdms2.tvariable.TransientVariable'> ; var.shape: (324, 60, 384, 320)
var.min():            0.0
var.max():            0.0
np.ma.mean(var.data): 0.0
-----
----- -----
filePath: so_Omon_CESM2_historical_r1i1p1f1_gn_185001-201412.nc
times: 1986 2014 ; total years: 29
Array size: 9450 Mb
type(var); <class 'cdms2.tvariable.TransientVariable'> ; var.shape: (336, 60, 384, 320)
var.min():            0.0
var.max():            0.0
np.ma.mean(var.data): 0.0
-----
----- -----
filePath: so_Omon_CESM2_historical_r1i1p1f1_gn_185001-201412.nc
times: 1985 2014 ; total years: 30
Array size: 9787 Mb
type(var); <class 'cdms2.tvariable.TransientVariable'> ; var.shape: (348, 60, 384, 320)
var.min():            0.0
var.max():            0.0
np.ma.mean(var.data): 0.0
-----
----- -----`
durack1 commented 4 years ago

@muryanto1 thanks for picking up and reproducing this issue. It'd be helpful to know whether @dnadeau4 or @doutriaux1 had worked on a fix a while ago, and if there are any open issues, branches or commits, or web documentation they can point us to for a resolution

mzelinka commented 4 years ago

Thanks for documenting and reproducing this issue. I am also hitting this issue. I note that it also occurs at least as far back as CDAT2.10.

jasonb5 commented 4 years ago

@durack1 How was this file created?

durack1 commented 4 years ago

@jasonb5 it’s one of the CMIP6 contributed files, NCAR doesn’t use CMOR so not 100% sure what software was used to create it

durack1 commented 4 years ago

Folks, just an FYI @jasonb5 determined the issue and found a fix, and @muryanto1 has wrapped this up in the nightly builds - thanks guys!! So for bleeding edge bug fixes come and get it

lee1043 commented 4 years ago

@durack1 great to know the issue has been resolved. Thank you all for the effort!

pochedls commented 4 years ago

@jasonb5 and @muryanto1 - Thank you! For those of us who prefer more stability than the nightly build, is this slated for a release? 8.2.x? 8.3?

muryanto1 commented 4 years ago

@pochedls Yes, but we do not have a time frame yet, but working on it.

jasonb5 commented 4 years ago

Linking PR https://github.com/CDAT/cdms/pull/389, this will be available in CDAT 8.2.1.