Unidata / netcdf4-python

netcdf4-python: python/numpy interface to the netCDF C library
http://unidata.github.io/netcdf4-python
MIT License
754 stars 262 forks source link

Possible bug relating to the setting of Variable chunksizes #1323

Open davidhassell opened 4 months ago

davidhassell commented 4 months ago

Hello,

I have found it impossible (at v1.6.5) to get netCDF4 to write out a file with the default chunking strategy - it either writes contiguous, or with explicitly set chunksizes, but never with the default chunks.

To test this I used the following function:

import netCDF4
import numpy as np

def write(**kwargs):
    nc = netCDF4.Dataset('chunk.nc', 'w')
    x = nc.createDimension('x', 80000)
    y = nc.createDimension('y', 4000)
    tas = nc.createVariable('tas', 'f8', ('y', 'x'), **kwargs)
    tas[...] = np.random.random(320000000).reshape(4000, 80000)
    print(tas.chunking())
    nc.close()

and ran it as follows:

In [2]: write()  # Not as expected - expected default chunking
contiguous
In [3]: !ncdump -sh chunk.nc | grep tas:
        tas:_Storage = "contiguous" ;
        tas:_Endianness = "little" ;

In [4]: write(contiguous=False)  # Not as expected - expected default chunking
contiguous
In [5]: !ncdump -sh chunk.nc | grep tas:
        tas:_Storage = "contiguous" ;
        tas:_Endianness = "little" ;

In [6]: write(contiguous=True)  # As expected 
contiguous
In [7]: !ncdump -sh chunk.nc | grep tas:
        tas:_Storage = "contiguous" ;
        tas:_Endianness = "little" ;

In [8]: write(chunksizes=(400, 8000))  # As expected 
[400, 8000]
In [9]: !ncdump -sh chunk.nc | grep tas:
        tas:_Storage = "chunked" ;
        tas:_ChunkSizes = 400, 8000 ;
        tas:_Endianness = "little" ;

Surely it's the case that if contiguous=False, chunksizes=None then the netCDF default chunking strategy should be used?

I found that if I changed line https://github.com/Unidata/netcdf4-python/blob/v1.6.5rel/src/netCDF4/_netCDF4.pyx#L4307 to read:

                    if chunksizes is not None or not contiguous:  # was: if chunksizes is not None or contiguous

then I could get the default chunking to work as expected:

In [2]: write()  # With modified code
[308, 6154]
In [3]: !ncdump -sh chunk.nc | grep tas:
        tas:_Storage = "chunked" ;
        tas:_ChunkSizes = 308, 6154 ;
        tas:_Endianness = "little" ;

In [4]: write(contiguous=False)  # With modified code
[308, 6154]
In [5]: !ncdump -sh chunk.nc | grep tas:        
                tas:_Storage = "chunked" ;
        tas:_ChunkSizes = 308, 6154 ;

In [6]: write(contiguous=True) # With modified code 
contiguous
In [7]: !ncdump -sh chunk.nc | grep tas:
        tas:_Storage = "contiguous" ;
        tas:_Endianness = "little" ;

In [8]: write(chunksizes=(400, 8000))  # With modified code
[400, 8000]
In [9]: !ncdump -sh chunk.nc | grep tas:
        tas:_Storage = "chunked" ;
        tas:_ChunkSizes = 400, 8000 ;
        tas:_Endianness = "little" ;

However, this might not be the best way to do things - what do you think?

Many thanks, David

>>> netCDF4.__version__
1.6.5
jswhit commented 4 months ago

The current code will not call nc_def_var_chunking at all if chunksizes=None and contiguous=False, which I would think would result in the library default chunking strategy.

jswhit commented 4 months ago

I think chunking is only used be default if there is an unlimited dimension. Try this:

import netCDF4
import numpy as np

def write(**kwargs):
    nc = netCDF4.Dataset('chunk.nc', 'w')
    x = nc.createDimension('x', 8000)
    y = nc.createDimension('y', 400)
    z = nc.createDimension('z', None)
    tas = nc.createVariable('tas', 'f8', ('z','y', 'x'), **kwargs)
    tas[0:10,:,:] = np.random.random(32000000).reshape(10,400, 8000)
    print(tas.chunking())
    nc.close()

write()
[1, 200, 4000]

so even if you specify contiguous=False you won't get chunking by default unless there is an unlimited dimension. If there is no unlimited dimension you have to specify the chunksize to get chunking.

I can see how this can be confusing since the default for the contingous kwarg is False, yet the library default is True unless there is an unlimited dimension. It does say this in the netcdf4-python docs though "Fixed size variables (with no unlimited dimension) with no compression filters are contiguous by default."

DennisHeimbigner commented 4 months ago

As near as I can tell, when a variable is created, it has default chunksizes computed automatically. Then, if later, nc_def_var_chunking is called, those default sizes should get overwritten.

davidhassell commented 4 months ago

Thanks for the background, @jswhit and @DennisHeimbigner - it's very useful.

So, not a bug then, but maybe a feature request! Could it be possible get netCDF4-python to write with the default chunking strategy a variable that has no unlimited dimensions? I guess that you don't want to change the existing API, so perhaps that could be controlled by a new keyword to createVariable?

Thanks, David

jswhit commented 4 months ago

@davidhassell it is already being reported - variables with no unlimited dimension are not chunked by default (they are contiguous).

davidhassell commented 4 months ago

Hi @jswhit, I see that what I wrote was ambiguous - sorry! I'll try again:

I would like to create chunked variables, chunked with the netCDF default chunk sizes, that have no unlimited dimensions. As far as I can tell this is not currently possible, but would you be open to creating this option?

jswhit commented 4 months ago

@davidhassell thanks for clarifying, I understand now. Since the python interface doesn't have access to the default chunking algorithm in the C library, I don't know how this would be done. I'm open to suggestions though.

jswhit commented 4 months ago

a potential workaround that doesn't require having an unlimited dimension is to turn on compression (zlib=True,complevel=1) or the fletcher checksum algorithm (fletcher32=True).