Unidata / netcdf-c

Official GitHub repository for netCDF-C libraries and utilities.
BSD 3-Clause "New" or "Revised" License
496 stars 261 forks source link

netCDF fails on zarr file with variable length array #2516

Open JamiePringle opened 1 year ago

JamiePringle commented 1 year ago

On netcdf 4.8.1 when I try to use ncdump to print a zarr file with a variable length (jagged) array, it crashes with a segmentation fault. On 4.9.0 on a mac, the same command hangs indefinitely. The zarr files have no compression, this is a separate issue from #2484 . The attached code creates two zarr data stores. The first, ExampleNoVlen.zarr just has a simple array, and ncdump 'file://ExampleNoVlen.zarr#mode=nczarr,zarr' produces

netcdf ExampleNoVlen {
dimensions:
    .zdim_5 = 5 ;
variables:
    double x(.zdim_5) ;
data:

 x = 0, 1, 2, 3, 4 ;
}

The second data store created, ExampleVlen.zarr, when printed with ncdump 'file://ExampleVlen.zarr#mode=nczarr,zarr'`, dies with a segmentation fault. When the file is loaded with python and zarr, printing the variable x produces

In [2]: data=zarr.open('ExampleVlen.zarr')

In [3]: data.x[:]
Out[3]: 
array([array([], dtype=float64), array([0.]), array([0., 1.]),
       array([0., 1., 2.]), array([0., 1., 2., 3.])], dtype=object)

The code to produce the two files in python with zarr is attached.

Thanks, Jamie Pringle

break_ZarrNetCDF.zip

WardF commented 1 year ago

I am working on a feedstock issue right now for conda/libnetcdf, but will take a look at this as soon as I can. Tagging @DennisHeimbigner in case an immediate solution leaps out.

JamiePringle commented 1 year ago

thank you

DennisHeimbigner commented 1 year ago

In looking at your python, I realized that I do no know what your ragged array looks like. Netcdf-4 and hence NCZarr does not support true ragged arrays except using the HDF5 vlen type, which is not supported in Zarr. So how did you envision your ragged array being layed out in memory?

JamiePringle commented 1 year ago

I am using the standard mechanism in zarr -- see https://zarr.readthedocs.io/en/stable/tutorial.html#ragged-arrays in which the object array is encoded with object_codec=numcodecs.VLenArray. These codecs are discussed in https://numcodecs.readthedocs.io/en/stable/vlen.html . I do not know how the VlenArray codec works internally. It is the standard model for zarr.

Jamie

DennisHeimbigner commented 1 year ago

Unfortunately, the text you refer to is in the Zarr tutorial. It is not part of standard Zarr as defined by the Zarr V2 spec.

JamiePringle commented 1 year ago

@DennisHeimbigner and @WardF -- It may be as you say. Would you like me to file a bug-report with the zarr folks to add this to their upcoming standard revision?

I understand the frustration of dealing with incomplete standard specifications. However this raises a philosophical question. Do you wish to just do exactly what the (inevitably flawed) standard says, or do you want to embrace the standard as implemented. I understand the difficulties in a decision like this, especially when there is a discrepancy between the folkways of a standard and the focal specification.

But as it stands, for someone distributing data, there is one quasi-official way to do ragged arrays in zarr, and netCDF does not support it, despite HDF supporting ragged arrays.

Would you like me to raise this issue with Zarr?

Jamie

WardF commented 1 year ago

I think raising the question over at the Zarr repo would be a good idea; writing to the spec and clearly delineating 'undefined behavior' is a better approach for us, I think, than adopting an approach and pushing the Zarr folk into a corner (or ourselves, if the spec is updated to contradict whatever behavior we've implemented). The good news is that both @DennisHeimbigner and I are involved with the Zarr community and enhancement groups, so we can ask this question during our next conference call :).

That said, raising the issue with Zarr is probably a good idea as well, so that the question is coming from the community and not just the group of people already talking about this sort of thing on a regular basis :)

DennisHeimbigner commented 1 year ago

I agree that this issue should be raised with the Zarr community.

You note:

Do you wish to just do exactly what the (inevitably flawed) standard says,
or do you want to embrace the standard as implemented. 

The short answer is yes; I do want to do exactly what the standard says. That is the only way to ensure interoperability across implementations. What you call the standard as implemented is (I assume) the python implementation. The key point about specifications is that they are intended to be implementation independent. With respect to the numcodecs Vlen, that package is completely dependent on Python and its pickling mechanism, and there is no easy correspondence in, say, the C language. used for NCZarr. In order to create an interoperable version of Vlen, it is necessary to define a language independent specification for it and for pickling.

JamiePringle commented 1 year ago

Ok, I will raise it in zarr's GitHub issue page. Jamie

JamiePringle commented 1 year ago

see discussion in https://github.com/zarr-developers/zarr-specs/issues/160