Unidata / netcdf-c

Official GitHub repository for netCDF-C libraries and utilities.
BSD 3-Clause "New" or "Revised" License
514 stars 262 forks source link

NCZarr does not support reading of many zarr files #2474

Open dzenanz opened 2 years ago

dzenanz commented 2 years ago

I am trying to read sample zarr files provided here: https://github.com/ome/ome-ngff-prototypes#data-availability

dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tcyx.ome.zarr#mode=nczarr,file
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tcyx.ome.zarr#mode=nczarr,file: NetCDF: NCZarr error
dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tcyx.ome.zarr#mode=nczarr
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tcyx.ome.zarr#mode=nczarr: NetCDF: NCZarr error
dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tcyx.ome.zarr#mode=zarr,file
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tcyx.ome.zarr#mode=zarr,file: NetCDF: NCZarr error
dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/cyx.ome.zarr#mode=zarr,file
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/cyx.ome.zarr#mode=zarr,file: NetCDF: NCZarr error
dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tczyx.ome.zarr#mode=zarr,file
free(): double free detected in tcache 2
Aborted (core dumped)
dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tyx.ome.zarr#mode=zarr,file
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tyx.ome.zarr#mode=zarr,file: NetCDF: NCZarr error
dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/yx.ome.zarr#mode=zarr,file
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/yx.ome.zarr#mode=zarr,file: NetCDF: NCZarr error
dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/zyx.ome.zarr#mode=zarr,file
free(): double free detected in tcache 2
Aborted (core dumped)
dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/multi-image.ome.zarr#mode=zarr,file
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/multi-image.ome.zarr#mode=zarr,file: NetCDF: NCZarr error
dzenan@corista:~/tester$ ncdump -h https://s3.embl.de/i2k-2020/ngff-example-data/v0.4#mode=zarr,file
ncdump: https://s3.embl.de/i2k-2020/ngff-example-data/v0.4#mode=zarr,file: NetCDF: Malformed URL
dzenan@corista:~/tester$ ncdump -h https://s3.embl.de/i2k-2020/ngff-example-data/v0.4
syntax error, unexpected WORD_WORD, expecting SCAN_ATTR or SCAN_DATASET or SCAN_ERROR
context: <?xml^ version="1.0" encoding="UTF-8"?><Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>ngff-example-data/v0.4.dds</Key><BucketName>i2k-2020</BucketName><Resource>/i2k-2020/ngff-example-data/v0.4.dds</Resource><RequestId>17096CA8F2A3955C</RequestId><HostId>083e6bd6-5f67-4d7a-bf85-13f52b390395</HostId></Error>
ncdump: https://s3.embl.de/i2k-2020/ngff-example-data/v0.4: NetCDF: file not found
dzenan@corista:~/tester$ ncdump -h s3:https://s3.embl.de/i2k-2020/ngff-example-data/v0.4
ncdump: s3:https://s3.embl.de/i2k-2020/ngff-example-data/v0.4: No such file or directory
dzenan@corista:~/tester$ ncdump -h s3://s3.embl.de/i2k-2020/ngff-example-data/v0.4
ncdump: s3://s3.embl.de/i2k-2020/ngff-example-data/v0.4: NetCDF: Attempt to use feature that was not turned on when netCDF was built.

Am I doing something wrong? I was under impression that NCZarr would read most zarr files. What are the offending features in these examples which break NCZarr?

Ubuntu 22.04.1 LTS, GCC 11.2.0, libnetcdf-dev Version: 1:4.8.1-1.

dzenanz commented 2 years ago

@DennisHeimbigner do you have some insight?

DennisHeimbigner commented 2 years ago

Sorry for the delay. It appears to be related to the use of '/' as the dimension separator. I was trying to track down exactly what i happening.

DennisHeimbigner commented 2 years ago

I was wrong. The problem is this (its complicated).

  1. array "s0" meta-data specifies this:
    • shape is [4,930,1024]
    • _ARRAY_DIMENSIONS are ["c","y","x"]
      1. array "s1" meta-data specifies this:
    • shape is [4,465,512]
    • _ARRAY_DIMENSIONS are ["c","y","x"]

Since both are in the same group (the root group), this is an inconsistency because one variable says y=930 and the other says y=465. You might contact the Xarray community to see how this contradiction should be resolved.

dzenanz commented 2 years ago

OME NGFF specifies that "Metadata about an image can be found under the "multiscales" key in the group-level metadata." As far as I can tell, multiple arrays are designed to have different dimensions to represent different levels of the image pyramid. @joshmoore Am I interpreting this correctly?

This might be contrary to NCZarr's constraints? Is there a way to reconcile this? Or is NCZarr a bad choice for Zarr-backend for NGFF implementation?

DennisHeimbigner commented 2 years ago

I do not know anything about the OME standard, but why do you think it is generating legal Zarr Version 2 files, never mind NCZarr.

joshmoore commented 2 years ago

Since both are in the same group (the root group), this is an inconsistency because one variable says y=930 and the other says y=465. You might contact the Xarray community to see how this contradiction should be resolved.

@DennisHeimbigner: but that should only be the case for mode=nczarr, no? There's no such restriction in Zarr itself.

As a side note, @dzenanz, https://github.com/ome/ngff/pull/114 is intended to play nicely with xarray and nczarr.

DennisHeimbigner commented 2 years ago

It is actually an XArray problem since the _ARRAY_DIMENSIONS attribute is not recognized in Zarr V2 (or has that changed?). But it should be the case that the names assigned to shapes should be consistent across all uses of those dimensions names, correct?

joshmoore commented 2 years ago

It is actually an XArray problem since the _ARRAY_DIMENSIONS attribute is not recognized in Zarr V2 (or has that changed?)

It's definitely true that Zarr v2 is agnostic to _ARRAY_DIMENIONS, but why should that impact mode=zarr filesets?

But it should be the case that the names assigned to shapes should be consistent across all uses of those dimensions names, correct?

Not in Zarr.

DennisHeimbigner commented 2 years ago

I was speaking more about nczarr and Xarray, than Zarr, because Zarr has no dimension names at all. Perhaps the interesting question is: what happens when that file is read by the Xarray package? What values does Xarray assign to the dimension names?

dzenanz commented 2 years ago

I assume that the highest resolution is used by default in Xarray.

DennisHeimbigner commented 2 years ago

In any case, try changing the access url to include the following: ".......#mode=zarr,noxarray,ZZZ" where ZZZ is either file or zip or s3 depending on what storage format you are reading.

DennisHeimbigner commented 2 years ago

I assume that the highest resolution is used by default in Xarray.

I would be surprised if that was so. It would be seen as violating shape constraints, I would think.

dzenanz commented 2 years ago
dzenan@corista:~/tester$ /usr/local/bin/ncdump -h file:///media/xeonatorC/Misc/Tester/CBCT_crop1ms.zarr#mode=zarr,noxarray,file
netcdf CBCT_crop1ms {
dimensions:
    _zdim_56 = 56 ;
    _zdim_63 = 63 ;
    _zdim_67 = 67 ;
    _zdim_28 = 28 ;
    _zdim_31 = 31 ;
    _zdim_33 = 33 ;
    _zdim_14 = 14 ;
    _zdim_15 = 15 ;
    _zdim_16 = 16 ;

group: scale0 {
NetCDF: internal library error; Please contact Unidata support
Location: file ; line 1661
dzenanz commented 2 years ago

@DennisHeimbigner you are right. xarray.open_zarr() yields:

<xarray.Dataset>
Dimensions:  ()
Data variables:
    *empty*
DennisHeimbigner commented 2 years ago

I would suggest two changes to the OME format:

  1. make the dimension-name <-> shape mapping be consistent
  2. Change the "direction" attribute from this: "direction": [ [ 1.0, 0.0, 0.0 ], [ 0.0, 1.0, 0.0 ], [ 0.0, 0.0, 1.0 ] ] to something like this. "direction": {"x": [1.0,0.0,0.0], "y": [0.0, 1.0, 0.0], "z": [0.0, 0.0, 1.0]}

There may be other attributes that also should be changed.

joshmoore commented 2 years ago

I would suggest two changes to the OME format:

Suggestion noted, thanks, @DennisHeimbigner, but I don't think we should open that conversation here.

Leaving nczarr, s3, and http out of the picture for a moment, I'm still struggling to understand whether you think these file-based pure-zarr examples from the original description should be covered by netcdf-c, @DennisHeimbigner:

dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tcyx.ome.zarr#mode=zarr,file
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tcyx.ome.zarr#mode=zarr,file: NetCDF: NCZarr error

dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/cyx.ome.zarr#mode=zarr,file
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/cyx.ome.zarr#mode=zarr,file: NetCDF: NCZarr error

dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tczyx.ome.zarr#mode=zarr,file
free(): double free detected in tcache 2
Aborted (core dumped)

dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tyx.ome.zarr#mode=zarr,file
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tyx.ome.zarr#mode=zarr,file: NetCDF: NCZarr error

dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/yx.ome.zarr#mode=zarr,file
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/yx.ome.zarr#mode=zarr,file: NetCDF: NCZarr error

dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/zyx.ome.zarr#mode=zarr,file
free(): double free detected in tcache 2
Aborted (core dumped)

dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/multi-image.ome.zarr#mode=zarr,file
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/multi-image.ome.zarr#mode=zarr,file: NetCDF: NCZarr error

In my mind, they should especially if we are pointing the community to this as the C implementation.

DennisHeimbigner commented 2 years ago

....these file-based pure-zarr examples from the original description should be covered by netcdf-c,

Speaking for myself, I say no. The fact that they cannot be read by Xarray makes it clear that having an inconsistent _ARRAY_DIMENSION <-> shape mapping is just wrong, and needs to be fixed.

As for the case of

"direction": [ [ 1.0, 0.0, 0.0 ], [ 0.0, 1.0, 0.0 ], [ 0.0, 0.0, 1.0 ] ]

that is more a matter of opinion and I would really think that using a dictionary instead of an array of arrays provides for more meaningful metadata. GDAL (https://gdal.org/drivers/raster/zarr.html), for instance, has non-standard attribute values whose value is a dictionary. I think that should be the convention for non-standard attributes.

DennisHeimbigner commented 2 years ago

Partly fixed by https://github.com/Unidata/netcdf-c/pull/2492

joshmoore commented 2 years ago

Speaking for myself, I say no. The fact that they cannot be read by Xarray makes it clear that having an inconsistent _ARRAY_DIMENSION <-> shape mapping is just wrong, and needs to be fixed.

@DennisHeimbigner, to clarify: is the intent for netcdf-c to be considered a Zarr library (as opposed to an nczarr library)?

Partly fixed by #2492

@dzenanz: will you have a chance to test if the partial fix works for you?

dzenanz commented 2 years ago

I am finishing up a grant proposal today and tomorrow, I plan to test #2492 later in the week.

DennisHeimbigner commented 2 years ago

,,,to clarify: is the intent for netcdf-c to be considered a Zarr library (as opposed to an nczarr library)?

The primary goal is to read/write netcdf-4 stored in Zarr format datasets. But since netcdf-4 does not exactly map to Zarr v2, it is necessary to add optional extensions to Zarr to support netcdf-4 concepts.

On the other hand, we do not want to cut ourselves off from pure Zarr datasets. So, a sub-goal is to support mapping a subset of netcdf-4 to pure Zarr. This sub-goal allows us to access pure Zarr datasets thru the netCDF API.

joshmoore commented 2 years ago

Thanks for the explanation, @DennisHeimbigner. I definitely understand the primary goal. But is there also a sub-goal of mapping all pure Zarr datasets to some subset of netcdf-4?

DennisHeimbigner commented 2 years ago

...is there also a sub-goal of mapping all pure Zarr datasets to some subset of netcdf-4?

Short answer is yes. For example, I have just added support for Zarr fixed length string type. There are a couple of other things I will be adding when there specification is stable: .zmetadata (aggregate metadata) and sharding. If there is something you would like to see that I have missed, then let me know.

joshmoore commented 2 years ago

I guess I defer there to @dzenanz' testing since in https://github.com/Unidata/netcdf-c/issues/2474#issuecomment-1211459122 there appear to be Zarr datasets that aren't openable. If it would help to copy them somewhere for testing, happy to do so.

dzenanz commented 2 years ago

Trying to read C:\Dev\ITKIOOMEZarrNGFF\v0.4\cyx.ome.zarr with updated version of netCDF still produces the same error: NetCDF: NCZarr error when invoking nc_open. Same happens for the other files there (yx.ome.zarr and multi-image.ome.zarr). I am using a slightly modified netCDF: https://github.com/dzenanz/netcdf-c/tree/itkSpecific.

DennisHeimbigner commented 2 years ago

That is because the OME use of _ARRAY_DIMENSION is still wrong and I cannot fix that particular problem.

dzenanz commented 2 years ago

Removing either _ARRAY_DIMENSION attributes or leaving just one scale gets me further: NetCDF: Filter error: undefined filter encountered. That is understandable, since the array has:

"compressor": {
    "blocksize": 0,
    "clevel": 5,
    "cname": "lz4",
    "id": "blosc",
    "shuffle": 1
}
DennisHeimbigner commented 2 years ago

How did you install netcdf-c? Are you building it yourself?

dzenanz commented 2 years ago

To have the latest version, I have to build it myself. My build setup is here: https://github.com/InsightSoftwareConsortium/ITKIOOMEZarrNGFF/pull/6 Here, I am trying to add zip support to the build: https://github.com/dzenanz/ITKIOOMEZarrNGFF/commit/0bbea2563204e8bb204387852ec4d41b465ccd01 the current stumbling block is here: https://github.com/dzenanz/ITKIOOMEZarrNGFF/commit/0bbea2563204e8bb204387852ec4d41b465ccd01#diff-1e7de1ae2d059d21e1dd75d5812d5a34b0222cef273b7c3a2af62eb747f9d20aR135-R137

DennisHeimbigner commented 2 years ago

So it looks like you are building with your own CMakeLists.txt and not using the one provided by netcdf, correct?

dzenanz commented 2 years ago

I modified netCDF's CMakeLists.txt, in order to point to HDF5 bundled with ITK, and to avoid some duplicate targets with other libraries I use together.