Open loichuder opened 4 years ago
As you can imagine, HSDS doesn't support pluggable filters (I imagine security people wouldn't be happy with clients injecting code onto the server for one thing...).
So any supported filter needs to be implemented in HSDS. The nice aspect of this is that clients can utilize any supported filter without change on their part.
I'd be open to a PR to support bitshuffle in HSDS. The shuffle filter is implemented here: https://github.com/HDFGroup/hsds/blob/master/hsds/util/storUtil.py:L43. You can see it's not much code, I guess it wouldn't be too hard to do something similar for bitshuffle.
You'll note the shuffle filter uses numba to speed up the shuffle/unshuffle. If I remember correctly, it's about a 100x faster than the non-jitted code. Ideally it would be nice to have a cython implementation for the filters. That should be even faster.
Hi,
The HDF5 bitshuffle filter is actually bitshuffle+lz4 compression in a single step, and it is seen as a compression filter from a HDF5 point of view. So from a look at the code, it would look more like the support of blosc and zlib (https://github.com/HDFGroup/hsds/blob/2d959f3f787663735fef1d0f7514e5c3e62178b5/hsds/util/storUtil.py#L223)
Hey, @ajelenak has updated the code to use the numcodecs package for unshuflling (vs. the original code that was using python with numba): https://github.com/HDFGroup/hsds/commit/f44f0718648be4559faeb85e8d4e0c07d348d858. If nothing else, this has reduced the size of the docker image by a factor of 4. Let us know if this is working with your HDF5 shuffled datasets!
Hi,
@loichuder tested with main
branch and it does not solve this issue.
The issue is that it is not the shuffle supported by HDF5 but the bitshuffle+LZ4 compression HDF5 plugin that is used for the dataset.
Yet, gaining a factor 4 in docker image is good!
Ah, sorry, I was thinking regular shuffle, not bitshuffle. We'll look into adding bitshuffle support. Do you have a sample file we can use for testing?
Hi,
Since it is used by dectris detectors, there's some dataset available in zenodo, e.g.: https://zenodo.org/record/2529979/files/thaumatin_9_1_000003.h5?download=1
That would be definitely nice to have, but don't add it especially for us, it will be complicated for us to use hsds in production in the near future since we can have HDF5 files with millions of entries, I expect it will be an issue to store the structure in a POSIX file system.
Best,
I don't know exactly what fixed it (perhaps #90) but serving bitshuffle compressed datasets now works :slightly_smiling_face:
I don't know exactly what fixed it (perhaps #90) but serving bitshuffle compressed datasets now works slightly_smiling_face
Nevermind, I went too fast: I looked at uncompressed datasets believing that they were compressed :neutral_face: ... Sorry for the useless ping
Supporting Bitshuffle filter in HSDS will require some effort because both the bitshuffle and hdf5plugin packages are geared towards h5py and HDF5 library. My understanding is Bitshuffle consists of two independent operations: bit shuffling, and then, LZ4 compression. Hopefully there is a way to implement those in some other way, and use the HDF5 filter cd_values[]
information to uncompress correctly the data in HSDS.
My understanding is Bitshuffle consists of two independent operations: bit shuffling, and then, LZ4 compression.
BTW, the LZ4
compression is optional.
Here is a function I use to decompress bitshuffle+LZ4 compressed chunks when using hdf5 direct chunk read. It's using the bitshuffle Python package:
import numpy, bitshuffle
def decompress_bslz4_chunk(payload, dtype, chunk_shape):
total_nbytes, block_nbytes = struct.unpack(">QI", payload[:12])
block_size = block_nbytes // dtype.itemsize
arr = numpy.frombuffer(payload, dtype=numpy.uint8, offset=12) # No copy here
chunk_data = bitshuffle.decompress_lz4(arr, chunk_shape, dtype, block_size)
return chunk_data
Do you know if the bitshuffle package can be installed or built without h5py and HDF5 library? I did not notice that option.
I don't think so, the HDF5 compression filter should be optional, but not the module bitshuffle.h5
.
That's my understanding, too. Thanks!
If someone would like to create a package that could do bitshuffle without the HDF library dependency, that would be much appreciated!
Bitshuffle should now be working in HSDS! I've put in updates to use the bitshuffle package and utilize bitshuffle+lz4 compression. (so in the filter pipeline, it's sufficient to just use the bitshuffle filter and no compressor). Chunks using bitshuffle compression will have a 12 byte header with an 8-byte chunk size followed by the sub-block size in bytes. (this is the same scheme as HDF5 files with butshuffle seem to use).
In h5pyd (version 0.17.0 or later), you can use the --link option so the HSDS domain will read from the bitshuffled chunks in the HDF5 file as needed. E.g.: hsload --link s3://hdf5.sample/data/hdf5test/bitshuffle.h5 /myhsdsfolder/
If anyone has time to try this out with their HDF5 bitshuffled data, it would be most appreciated! Please let me know if you have questions or comments.
I'll leave this issue open for a bit in case any problems come up.
Gave it a try but I get an error when running hsload
on a file with bitshuffled data inside:
ERROR 2023-12-01 09:38:33,615 POST error: 400
ERROR 2023-12-01 09:38:33,615 POST error - status_code: 400, reason: filter H5Z_FILTER_BITSHUFFLE not recognized
ERROR 2023-12-01 09:38:33,615 ERROR: failed to create dataset: filter H5Z_FILTER_BITSHUFFLE not recognized
Traceback (most recent call last):
File ".../venv/bin/hsload", line 11, in <module>
load_entry_point('h5pyd', 'console_scripts', 'hsload')()
File ".../h5pyd/h5pyd/_apps/hsload.py", line 290, in main
load_file(fin, fout, **kwargs)
File ".../h5pyd/h5pyd/_apps/utillib.py", line 1619, in load_file
fin.visititems(object_create_helper)
File ".../venv/lib/python3.8/site-packages/h5py/_hl/group.py", line 636, in visititems
return h5o.visit(self.id, proxy)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5o.pyx", line 355, in h5py.h5o.visit
File "h5py/h5o.pyx", line 302, in h5py.h5o.cb_obj_simple
File ".../venv/lib/python3.8/site-packages/h5py/_hl/group.py", line 635, in proxy
return func(name, self[name])
File ".../h5pyd/h5pyd/_apps/utillib.py", line 1582, in object_create_helper
create_dataset(obj, ctx)
File ".../h5pyd/h5pyd/_apps/utillib.py", line 1197, in create_dataset
dset = fout.create_dataset(dobj.name, **kwargs)
File ".../h5pyd/h5pyd/_hl/group.py", line 381, in create_dataset
dsid = dataset.make_new_dset(self, shape=shape, dtype=dtype, data=data, **kwds)
File ".../h5pyd/h5pyd/_hl/dataset.py", line 299, in make_new_dset
rsp = parent.POST(req, body=body)
File ".../h5pyd/h5pyd/_hl/base.py", line 1043, in POST
raise IOError(rsp.reason)
OSError: filter H5Z_FILTER_BITSHUFFLE not recognized
I used the h5pyd
latest version (https://github.com/HDFGroup/h5pyd/commit/38278c8399564e2eeafaa5f44ecabba23df99ecd) to run hsload --link
on a file that contain several compressed datasets:
import h5py
import numpy as np
import hdf5plugin
with h5py.File("compressed.h5", "w") as h5file:
data = np.random.random((1000, 1000))
h5file.create_dataset("gzip", data=data, compression="gzip")
h5file.create_dataset("szip", data=data, compression="szip")
h5file.create_dataset("scaleoffset", data=data, scaleoffset=4)
h5file.create_dataset(
"gzip_shuffle", data=data, compression="gzip", shuffle=True
)
h5file.create_dataset("bitshuffled", data=data, **hdf5plugin.Bitshuffle())
Thanks for trying it! Did you have the latest HSDS? version 0.8.5 hsinfo will show the server version.
hsinfo
shows server version: 0.7.0beta
But I pulled the latest version from this repo and used runall.sh
to launch the service. So it should use the latest version, no ?
No, the runall.sh will just run the last build. You need to do a build.sh first. :)
Ah yes. Sorry, it has been a while :sweat_smile:
All right, I could do the hsload --link
with the updated version.
However, the data node seems to crash when wanting to uncompress the data:
1701424543.784 DEBUG> _uncompress(compressor=None, shuffle=2)
1701424543.784 DEBUG> got bitshuffle header - total_nbytes: 31752, block_nbytes: 8192, block_size: 1024
/entrypoint.sh: line 27: 7 Illegal instruction (core dumped) hsds-datanode
The nbytes and block_nbytes seem reasonable at least. Strange you had an illegal instruction error. Is HSDS running on a x86 or ARM? I tried loading the zenodo file you had above - that worked for me. Is the file you used available for downloading?
Running on x86_64
.
I guess it is because I tried to specify a slice when trying to get the data (via select
) ? Can you try that on your end ?
EDIT: No it is unrelated. sorry. Here is the file I used: https://cloud.esrf.fr/s/btA8C4aB8C9YMLH
@loichuder - I'm able to download the file you used, but seems like it's not an HDF5 file:
$ wget https://cloud.esrf.fr/s/btA8C4aB8C9YMLH
--2023-12-03 16:29:09-- https://cloud.esrf.fr/s/btA8C4aB8C9YMLH
Resolving cloud.esrf.fr (cloud.esrf.fr)... 193.49.43.142
Connecting to cloud.esrf.fr (cloud.esrf.fr)|193.49.43.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 27860 (27K) [text/html]
Saving to: ‘btA8C4aB8C9YMLH’
btA8C4aB8C9YMLH 100%[===========================================>] 27.21K 105KB/s in 0.3s
2023-12-03 16:29:11 (105 KB/s) - ‘btA8C4aB8C9YMLH’ saved [27860/27860]
$ h5ls -r btA8C4aB8C9YMLH
btA8C4aB8C9YMLH: unable to open file
Yeah it seems wget https://cloud.esrf.fr/s/btA8C4aB8C9YMLH
gets you a HTML file.
You can do
wget https://cloud.esrf.fr/s/btA8C4aB8C9YMLH/download/compressed.h5
instead.
ok - I got that and it seems load ok.
Here's what I did:
Copied the file to s3: aws s3 cp compressed.h5 s3://hdf5.sample/data/hdf5test/bitshuffle3.h5
Ran hsload on it: hsload --link s3://hdf5.sample/data/hdf5test/bitshuffle3.h5 /home/test_user1/
Checked out the loaded file:
$ python
>>> import h5pyd
>>> f = h5pyd.File("/home/test_user1/bitshuffle3.h5")
>>> list(f)
['bitshuffled']
>>> dset = f["bitshuffled"]
>>> dset
<HDF5 dataset "bitshuffled": shape (1000, 1000), type "<f8">
>>> dset._filters
[{'class': 'H5Z_FILTER_BITSHUFFLE', 'id': 32008, 'name': 'bitshuffle'}]
>>> data = dset[:,:]
>>> data[0,:10]
array([0.64140836, 0.49246921, 0.0550563 , 0.7508278 , 0.24011797,
0.55795149, 0.70128625, 0.50743519, 0.98187454, 0.28679516])
>>> import hashlib
>>> hash = hashlib.md5(data.tobytes())
>>> hash.hexdigest()
'1e833ef8a30ecad36ddd94f24d0cbe31'
Do you see the same content when you open the file with h5py?
I get the same thing as you, up to the part data = dset[:,:]
. This hangs and finally raises a TimeoutError
:
>>> data = dset[:,:]
Traceback (most recent call last):
File ".../python3.8/site-packages/urllib3/response.py", line 443, in _error_catcher
yield
File ".../python3.8/site-packages/urllib3/response.py", line 566, in read
data = self._fp_read(amt) if not fp_closed else b""
File ".../python3.8/site-packages/urllib3/response.py", line 532, in _fp_read
return self._fp.read(amt) if amt is not None else self._fp.read()
File "/usr/lib/python3.8/http/client.py", line 459, in read
n = self.readinto(b)
File "/usr/lib/python3.8/http/client.py", line 503, in readinto
n = self.fp.readinto(b)
File "/usr/lib/python3.8/socket.py", line 669, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File ".../python3.8/site-packages/requests/models.py", line 816, in generate
yield from self.raw.stream(chunk_size, decode_content=True)
File ".../python3.8/site-packages/urllib3/response.py", line 627, in stream
data = self.read(amt=amt, decode_content=decode_content)
File ".../python3.8/site-packages/urllib3/response.py", line 592, in read
raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
File "/usr/lib/python3.8/contextlib.py", line 131, in __exit__
self.gen.throw(type, value, traceback)
File ".../python3.8/site-packages/urllib3/response.py", line 448, in _error_catcher
raise ReadTimeoutError(self._pool, None, "Read timed out.")
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='...', port=8000): Read timed out.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File ".../h5pyd/h5pyd/_hl/dataset.py", line 1160, in __getitem__
rsp = self.GET(req, params=params, format="binary")
File ".../h5pyd/h5pyd/_hl/base.py", line 975, in GET
for http_chunk in rsp.iter_content(chunk_size=HTTP_CHUNK_SIZE):
File ".../python3.8/site-packages/requests/models.py", line 822, in generate
raise ConnectionError(e)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='...', port=8000): Read timed out.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../h5pyd/h5pyd/_hl/dataset.py", line 1169, in __getitem__
raise IOError(f"Error retrieving data: {ioe.errno}")
OSError: Error retrieving data: None
I guess it's progress that you at least got to the dset[:,:].
I wonder if the timeout is unrelated to bitshuffle, and just an effect of trying to fetch a large block of data in one request.
Could you change the "data = dset[:,:]" to use a chunk iterator instead?
Like this:
data = np.zeros(dset.shape, dtype=dset.dtype)
for s in dset.iter_chunks():
data[s] = dset[s]
It takes quite some time but I get another error:
Traceback (most recent call last):
File ".../h5pyd/h5pyd/_hl/dataset.py", line 1160, in __getitem__
rsp = self.GET(req, params=params, format="binary")
File ".../h5pyd/h5pyd/_hl/base.py", line 981, in GET
raise IOError("no data returned")
OSError: no data returned
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File ".../h5pyd/h5pyd/_hl/dataset.py", line 1169, in __getitem__
raise IOError(f"Error retrieving data: {ioe.errno}")
OSError: Error retrieving data: None
No sure if this is related but import h5pyd
also takes a while to resolve in the Python console.
Could you take a look at the docker logs and see if anything obvious shows up?
Good call, the service node indeed has an error:
1702469460.055 WARN> 503 error for http_get_Json http://172.22.0.4:6101/chunks/c-d6f45ee7-90e9c4fa-cb6d-10b780-7f110c_0_0
...
1702469460.056 ERROR> got <class 'aiohttp.web_exceptions.HTTPInternalServerError'> exception doing getSelectionData: Internal Server Error
...
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/aiohttp/web_protocol.py", line 332, in data_received
messages, upgraded, tail = self._request_parser.feed_data(data)
File "aiohttp/_http_parser.pyx", line 557, in aiohttp._http_parser.HttpParser.feed_data
aiohttp.http_exceptions.BadStatusLine: 400, message:
Bad status line "Invalid method encountered:\n\n b''\n ^"
See below for a more complete picture
1702469460.050 INFO> read_chunk_hyperslab, chunk_id: c-d6f45ee7-90e9c4fa-cb6d-10b780-7f110c_0_0, bucket: files 1702469460.050 DEBUG> using chunk_map entry for c-d6f45ee7-90e9c4fa-cb6d-10b780-7f110c_0_0: {'s3path': 'files/compressed.h5', 's3offset': 4016, 's3size': 27699, 'chunk_sel': [slice(0, 63, 1), slice(0, 63, 1)], 'data_sel': (slice(0, 63, 1), slice(0, 63, 1))} 1702469460.050 DEBUG> read_chunk_hyperslab - chunk_sel: [slice(0, 63, 1), slice(0, 63, 1)] 1702469460.050 DEBUG> read_chunk_hyperslab - data_sel: (slice(0, 63, 1), slice(0, 63, 1)) 1702469460.050 DEBUG> hyperslab selection - chunk_shape: [63, 63] 1702469460.051 DEBUG> getNodeCount for dn_urls: ['http://172.22.0.4:6101', 'http://172.22.0.5:6101', 'http://172.22.0.6:6101', 'http://172.22.0.7:6101'] 1702469460.051 DEBUG> got dn_url: http://172.22.0.4:6101 for obj_id: c-d6f45ee7-90e9c4fa-cb6d-10b780-7f110c_0_0 1702469460.051 DEBUG> read_chunk_hyperslab - GET chunk req: http://172.22.0.4:6101/chunks/c-d6f45ee7-90e9c4fa-cb6d-10b780-7f110c_0_0 1702469460.051 DEBUG> params: {'s3path': 'files/compressed.h5', 's3offset': 4016, 's3size': 27699, 'bucket': 'files', 'select': '[0:63,0:63]'} 1702469460.051 INFO> http_get('http://172.22.0.4:6101/chunks/c-d6f45ee7-90e9c4fa-cb6d-10b780-7f110c_0_0') 1702469460.055 INFO> http_get status: 503 for req: http://172.22.0.4:6101/chunks/c-d6f45ee7-90e9c4fa-cb6d-10b780-7f110c_0_0 1702469460.055 WARN> 503 error for http_get_Json http://172.22.0.4:6101/chunks/c-d6f45ee7-90e9c4fa-cb6d-10b780-7f110c_0_0 1702469460.055 WARN> HTTPServiceUnavailable for read_chunk_hyperslab(c-d6f45ee7-90e9c4fa-cb6d-10b780-7f110c_0_0): Service Unavailable 1702469460.056 ERROR> ChunkCrawler action: read_chunk_hyperslab failed after: 7 retries 1702469460.056 INFO> ChunkCrawler - worker status for chunk c-d6f45ee7-90e9c4fa-cb6d-10b780-7f110c_0_0: 503 1702469460.056 DEBUG> ChunkCrawler - task c-d6f45ee7-90e9c4fa-cb6d-10b780-7f110c_0_0 start: 1702469445.763 elapsed: 14.293 1702469460.056 INFO> ChunkCrawler - join complete - count: 1 1702469460.056 DEBUG> ChunkCrawler - workers canceled 1702469460.056 INFO> returning chunk_status: 503 for chunk: c-d6f45ee7-90e9c4fa-cb6d-10b780-7f110c_0_0 1702469460.056 INFO> doReadSelection complete - status: 503 1702469460.056 INFO> doReadSelection raising HTTPInternalServerError for status: 503 1702469460.056 ERROR> gotexception doing getSelectionData: Internal Server Error 1702469460.056 INFO> streaming data for 1 pages complete, 0 bytes written 1702469460.056 DEBUG> ChunkCrawler - worker has been cancelled Error handling request Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/aiohttp/web_protocol.py", line 332, in data_received messages, upgraded, tail = self._request_parser.feed_data(data) File "aiohttp/_http_parser.pyx", line 557, in aiohttp._http_parser.HttpParser.feed_data aiohttp.http_exceptions.BadStatusLine: 400, message: Bad status line "Invalid method encountered:\n\n b''\n ^"
I suspect the real error is occurring in one of the DN containers. Could you take a look at the DN logs as well?
To make life easier when debugging this kind of issue, I will usually start the server with ./runall.sh 1
, so there's just one DN container. That way there's only one DN log file to look at.
Hey @loichuder - have you run into problems using bitshuffle with packages that require numpy >= 2.0? It seems that the bitshuffle repo hasn't been updated in quite a while at this is causing problems moving to the new numpy release. See: https://github.com/HDFGroup/hsds/issues/378.
I am trying to serve POSIX files that contain datasets compressed with the bitshuffle filter.
hsload --link
works without any trouble as do requests to metadata and uncompressed datasets. However, requests to the compressed datasets fail with the following errors in the datanode:and
.
How should I proceed to be able to request such datasets ? Given that I do
hsload --link
, should I look into HSDS rather than h5pyd ?