fsspec / kerchunk

Cloud-friendly access to archival data
https://fsspec.github.io/kerchunk/
MIT License
309 stars 80 forks source link

Add support for H5Z_FILTER_SHUFFLE for GZIP/ZLIB compressed data #11

Closed pbranson closed 3 years ago

pbranson commented 3 years ago

I have started testing out this approach on some of the Australian Integrated Marine Observing System datasets that are stored in NetCDF format on S3 (eg. s3://imos-data/IMOS/SRS/OC/gridded/aqua/P1D/2014/08/A.P1D.20140801T053000Z.aust.K_490.nc )

This dataset utilises gzip compression with the H5Z_FILTER_SHUFFLE filter applied. I have looked over the upstream code dependencies and would appreciate some advice on the pathway forward.

Currently the numcodecs gzip and zlib Codecs do not support a shuffle option. The code for the shuffle is pretty straight forward, however will likely be very slow in python if not compiled: https://github.com/HDFGroup/hsds/blob/03890edfa735cc77da3bc06f6cf5de5bd40d1e23/hsds/util/storUtil.py#L43 numcodecs uses cython for compiled code rather than numba.

I am keen to help get this sorted and suggest one possible way forward could be:

Appreciate advice from some of the more experienced devs here (@martindurant, @ajelenak and @rabernat) if you think this is a reasonable way forward?

Thanks for everyone's effort on this!

martindurant commented 3 years ago

That sounds like a totally reasonable plan. I don't think that numcodecs requires you to use cython, but it may be convenient to keep to the same coding style. Whether this needs to be an option to the existing codec or to be a separate one is an open question.

As discussed in https://github.com/intake/fsspec-reference-maker/issues/9 , SZIP from libaec would be worth adding, as well as perhaps fletcher32 checksum (although I don't know how the checksums are stored)

PS: I had no idea that HDF (or HSDS) used numba internally. Probably numcodecs is not interested in taking numba as a dependency, but the question could be asked (I personally am in the zarr team, but have not contributed to numcodecs).

rabernat commented 3 years ago

Sounds like a great plan Paul! I'll be happy to help out on the numcodecs side.

ajelenak commented 3 years ago

The Blosc library does have the C code for byte and bit shuffling so that could be used as a starting point. I thought about doing this when working on the HDF5-Zarr translator but did not have enough time on that project to actually do it.

Another thing, the Zarr spec supports filters so @pbranson's first option is probably the best approach. The shuffle would be declared as a filter and be available to any compression method.

pbranson commented 3 years ago

Thanks for the advice and the suggestions about the filters and reusing the blosc shuffle - there is plenty of hardware optimisation there that is worth leveraging.

I have dug through the blosc code and found where the byte shuffle is implemented: blosc_internal_shuffle(typesize, blocksize, src, tmp); https://github.com/Blosc/c-blosc/blob/9fae1c9acb659159321aca69aefcdbce663e2374/blosc/shuffle.c#L374

The shuffle.h header is exposed in numcodecs yet, I havent done any cython development work but it seems that the headers can be exposed via a cdef like this: https://github.com/zarr-developers/numcodecs/blob/master/numcodecs/blosc.pyx#L21

And the zarr filters are just calls to numcodec codecs - so I could make a Shuffle(Codec) that uses the blosc shuffle in a similar way that blosc.pyx does it and then we dont need to change the gzip etc codecs.

Have updated the checklist with a revised plan and will see about raising the issue at numcodecs.

martindurant commented 3 years ago

I think the suggestion may have been that you can already use the blosc compressor, as wrapped by numcodecs, to do gzip/shuffle without any extra code: numcodecs.blosc.Blosc("gzip", compression_level, shuffle, blocksize), where

pbranson commented 3 years ago

Oh! I didn't realise that Blosc also provided a number of different compressors. I think I will continue with the addition of the shuffle filter, probably won't perform as well as maintaining the context for the shuffle operation within C, but adds some flexibility for other numcodecs....

Will test the Blosc gzip approach in this library.

On Thu., 26 Nov. 2020, 10:51 pm Martin Durant, notifications@github.com wrote:

I think the suggestion may have been that you can already use the blosc compressor, as wrapped by numcodecs, to do gzip/shuffle without any extra code: numcodecs.blosc.Blosc("gzip", compression_level, shuffle, blocksize), where

  • 0: NOSHUFFLE
  • 1: SHUFFLE
  • 2: BITSHUFFLE

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/intake/fsspec-reference-maker/issues/11#issuecomment-734340264, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADG5WQBD4TQTQHEQDJHVO33SRZTQLANCNFSM4UBW7XDQ .

martindurant commented 3 years ago

Will test the Blosc gzip approach in this library.

OK, let us know! Note that, if you are running any tests with Dask and multiple threads, you may want to set the number of blosc threads to 1, e.g., blosc.set_nthreads(1).

Looking through the blosc documentation now, I don't see a way to turn OFF the blosc-specific header data in the comp/decompression functions currently available to python.

ajelenak commented 3 years ago

Decompressing HDF5 dataset chunks using Blosc gzip and shuffle did not work for me exactly because of that 16-byte Blosc header. That's why I suggested shuffle as a Zarr filter.

martindurant commented 3 years ago

Annoying, right? We could propose that blosc expose more deeper functionality to python...

martindurant commented 3 years ago

@pbranson , any luck with this?

@ajelenak : it does occur to me that it probably is very easy to create the blosc header for each chunk to be decompressed. That could be a super-simple numcodecs filter, or a custom part of a zarr store class. Even fsspec could do this, but it's less obvious a fit there.

ajelenak commented 3 years ago

I am not familiar with the Blosc header but it should be possible to do something like that. I think a proper shuffle filter is still a better long term approach.

pbranson commented 3 years ago

After understanding that blosc add's a header I stopped pursuing that approach.

I have made a start on setting up the numcodec cython and classes, but getting swamped with a pre-christmas work crunch.

I need to test if the full blosc context/init is required use the shuffle functions - I am not sure that it does which would be convenient.

I can see how adding a blosc header filter could also be an option, but would that still require a memory copy operation to make the data contiguous prior to sending to blosc-c? In which case it is possibly faster to do the shuffle directly.

Hopefully get back to it tomorrow.

martindurant commented 3 years ago

would that still require a memory copy operation

Yes, I suppose so, unless we somehow read the bytes data into a buffer with an initial offset of two bytes, and then fill in the header after reading. Of course, the point is that the compressed data should be much smaller than the final data buffer, so maybe a copy isn't so bad.

pbranson commented 3 years ago

OK I have provided an update of my progress here: https://github.com/zarr-developers/numcodecs/issues/260

Having trouble with incompatible pointer types (I think)

it does occur to me that it probably is very easy to create the blosc header for each chunk to be decompressed

Possibly that is an easier way to go is that documented here: https://blosc-doc.readthedocs.io/en/latest/frame_format.html?highlight=chunk%20format#description-for-different-fields-in-header ?

martindurant commented 3 years ago

Blosc 2 is not in use anywhere - but it's documentation has completely takes over from Blosc v1! I'm pretty sure it's two bytes, giving the compression type, shuffle, and size. I swear I saw the spec recently.

pbranson commented 3 years ago

Oh and I thought it was just me struggling to find it...!

pbranson commented 3 years ago

OK I have finally got back to this - I have implemented the shuffle with numba and raised a PR to get some feedback from the numcodecs devs.

Have have updated my fork of fsspec_reference_maker to add support for shuffle and tested it out in this gist https://gist.github.com/pbranson/6edbe1e3c4214e6b826f4e82e5df55fb Not terribly scientific, but a 2x speed up over using OpenDAP from my laptop on the uni network - there appeared to be some caching going on then I used %%timeit

Also it seems that these lines in the main branch have been commented out by mistake? https://github.com/intake/fsspec-reference-maker/blob/main/fsspec_reference_maker/hdf.py#L101 It prevents writing the zarr related dictionary entries (.zgroup etc) to the json.

martindurant commented 3 years ago

Also it seems that these lines in the main branch have been commented out by mistake?

To be checked, but I don't think we need those, everything is in chunkstore

martindurant commented 3 years ago

Does numcodecs support numba? Or are you suggesting this as a filter step somewhere else in the pipeline?

pbranson commented 3 years ago

Does numcodecs support numba?

Not at present, but I have asked about it in https://github.com/zarr-developers/numcodecs/pull/273

It seems according to https://github.com/zarr-developers/numcodecs/pull/274 that they are moving towards a more pure-python approach (for blosc at least) so maybe they might consider adding numba? Should I ping someone specifically in that PR?

Or are you suggesting this as a filter step somewhere else in the pipeline?

Probably not, it seems logical for the shuffle to be in numcodecs - but maybe it should be cython rather than numba.

martindurant commented 3 years ago

I suppose the usual maintainers will notice; but I doubt that numba as a dep is a great fit. We'll see. Cython would fit better, for the case where existing python bindings to some compiled code doesn't exist.