Open ExpandingMan opened 5 years ago
Can you be more specific, eg do you want to do something like this?
s3_bucket_with_hdf5_file_in_it = S3.openbucket()
h5 = HDF5.h5open(s3_bucket_with_hdf5_file_in_it, "rw")
h5["a"]=4
x = h5["b"]
In that case it would involve writing to a IOBuffer
object, taking the Vector{UInt8}
buffer and sending it to AWS via HTTP. Something like
io = IOBuffer()
get_data_from_s3!(s3, io)
h5 = h5open(io)
I suppose.
Hi @ExpandingMan. Did you ever manage to find a solution to being able to read HDF5 files from an S3 object store? The HDF5/S3 connector may be useful here, but I'm not sure if this has since been solved in a different way. Thanks!
The canonical way would be to use HDF5 virtual file drivers. https://docs.hdfgroup.org/hdf5/v1_12/_v_f_l.html
@mkitti - that’s what I suspected. Any idea if this is available/integrated with HDF5.jl? My understanding is virtual file drivers need to be selected at HDF5 compile-time. I presume we’ll need some changes from HDF5_jll to get this support?
We started adding support for drivers here: https://github.com/JuliaIO/HDF5.jl/blob/master/src/drivers/drivers.jl
We may be able to use the Core driver to read an I/O stream completely into memory and use that.
I haven't had many occasions to use HDF5, but when I did I was certainly resorting to temp files, which is certainly not ideal. In linux it's very easy to do this all in-memory (you can use /dev/shm
or another in-memory directory) there is probably still a lot of overhead to that, so it's in no way an ideal solution.
at the very least this lib can support the ROS3 driver written by the HDFgroup? Perhaps following this python-equivalent PR: https://github.com/h5py/h5py/pull/1755. I recommend aiming for the following solution:
h5open(s3path; driver=Drivers.ROS3()) do file
file
end
The ROS3 driver seems quite distinct from the rest of the issue. Could you create a new issue, please?
Hello, I am trying to send and receive some HDF5 files via network without writing to a file. I think the only way to do this would also be to read it from a generic IO or from a byte array. Is there any update on this, can this be done at this point?
I think we might be able to do this via H5FD_CORE via HDF5.Drivers.Core
and HDF5.API.h5p_set_file_image
Basically, I think we have exposed the underlying low-level C API to do this in Julia, but have not created a high level API for this.
Thanks, unfortunately I am not familiar with the low-level API at all, but Ill see if I can get this to work somehow. I found the h5py supports this already, maybe one day this could work in Julia as well with just passing an IO object instead of a filename? From https://docs.h5py.org/en/stable/high/file.html?highlight=driver#h5py.File.driver
"""Create an HDF5 file in memory and retrieve the raw bytes
This could be used, for instance, in a server producing small HDF5
files on demand.
"""
import io
import h5py
bio = io.BytesIO()
with h5py.File(bio, 'w') as f:
f['dataset'] = range(10)
data = bio.getvalue() # data is a regular Python bytes object.
print("Total size:", len(data))
print("First bytes:", data[:10])
I looked into how they implemented that. They implemented a virtual file driver: https://github.com/h5py/h5py/blob/2e95e93b1331fd6b9c43dea38c863642624d319c/h5py/h5fd.pyx#L87-L101
In [77]: with h5py.File(bio, "w") as f:
...: print(f._id.get_access_plist().get_driver())
...:
576460752303423496
In [78]: h5py.h5fd.fileobj_driver
Out[78]: 576460752303423496
This is a bit overkill if all you need to do is read it into memory though.
Would implementing something like in h5py make it to the roadmap of this package for the near future? We have a project that would require such data over network in the future and HDF5 was used previously. An alternative might be to send the data flattened as vectors in the Arrow format though.
We have a project that would require such data over network in the future and HDF5 was used previously
Have you considered the ROS3 (read only S3) driver? Do you need write capability over the network as well?
Another approach is detailed here: https://medium.com/pangeo/cloud-performant-netcdf4-hdf5-with-zarr-fsspec-and-intake-3d3a3e7cb935
It might be good to fully understand what you mean by network here and what your requirements are for access. Are you using chunked datasets with compression? Is read-only ok, do you do read-write? To what degree does this have to scale.
Would implementing something like in h5py make it to the roadmap of this package for the near future?
The custom file driver approach does not seem very hard to do. It's actually significantly easier to do from Julia, so it's mainly about time and priorities.
We're gearing up for a 0.17 breaking release, so that's where my focus is at the moment.
A simple alternative would be to write to a RAM disk, then copy it over.
Our application is an R&D project that involves a line scanner (basically laser+high def camera) and reading from network would be enough in first step. The data is collected using a C++ program and then distributed to consumers in a batch approx every second over MQTT (local network only). Some analysis of the collected batch is then to be done in julia and the result forwarded using Mqtt again. The data would be 4 matrices of around 300x4000 Float32 values every second, we wanted to use HDF5 files with blosc-lz4 compression.
C++ part is still to be adapted anyway, currently working on specification where I describe this data exchange. I am leaning towards Arrow tbh, but I will try a bit more with HDF5 as this was used in a similar project. Maybe PyCall and h5py would also be a solution.
OK, you nerd sniped me. Here is a demonstration of the Core driver
julia> using HDF5, H5Zblosc, CRC32c
julia> checksum(dataset) = crc32c(copy(reinterpret(UInt8, dataset[:])))
checksum (generic function with 1 method)
julia> function create_file_inmemory(dataset = rand(1:10, 256, 256))
@info "Dataset Checksum" checksum(dataset)
# Create File Access Property List
fapl = HDF5.FileAccessProperties()
fapl.driver = HDF5.Drivers.Core(; backing_store=false)
# Create file in memory
name = "inmemtest"
fid = HDF5.API.h5f_create(name, HDF5.API.H5F_ACC_EXCL, HDF5.API.H5P_DEFAULT, fapl)
h5f = HDF5.File(fid, name)
write_dataset(h5f, "lz4_comp_dataset", dataset, chunk=(16,16), filters=BloscFilter())
HDF5.API.h5f_flush(h5f, 1)
# Get file image
buf_len = HDF5.API.h5f_get_file_image(h5f, C_NULL, 0)
inmemfile = Vector{UInt8}(undef, buf_len)
HDF5.API.h5f_get_file_image(h5f, inmemfile, length(inmemfile))
# Finish
close(h5f)
return inmemfile
end
create_file_inmemory (generic function with 2 methods)
julia> function read_file_inmemory(inmemfile::Vector{UInt8})
# Create File Access Property List
fapl = HDF5.FileAccessProperties()
fapl.driver = HDF5.Drivers.Core(; backing_store=false)
HDF5.API.h5p_set_file_image(fapl, inmemfile, length(inmemfile))
# Open the file in memory
name = "inmemtest"
fid = HDF5.API.h5f_open(name, HDF5.API.H5F_ACC_RDONLY, fapl)
h5f = HDF5.File(fid, name)
display(h5f)
dataset = h5f["lz4_comp_dataset"][]
# Finish
close(h5f)
@info "Dataset Checksum" checksum(dataset)
return dataset
end
read_file_inmemory (generic function with 1 method)
julia> inmemfile = create_file_inmemory();
┌ Info: Dataset Checksum
└ checksum(dataset) = 0x3ecfcea2
julia> read_file_inmemory(inmemfile);
🗂️ HDF5.File: (read-only) inmemtest
└─ 🔢 lz4_comp_dataset
┌ Info: Dataset Checksum
└ checksum(dataset) = 0x3ecfcea2
julia> write("ondisk.h5", inmemfile)
117208
julia> run(`h5ls -v ondisk.h5`)
Opened "ondisk.h5" with sec2 driver.
lz4_comp_dataset Dataset {256/256, 256/256}
Location: 1:800
Links: 1
Chunks: {16, 16} 2048 bytes
Storage: 524288 logical bytes, 100352 allocated bytes, 522.45% utilization
Filter-0: blosc-32001 OPT {2, 2, 8, 2048, 5, 1, 0}
Type: native long
Process(`h5ls -v ondisk.h5`, ProcessExited(0))
https://github.com/JuliaIO/HDF5.jl/pull/1077 should make reading and writing files from memory easier.
Wow, thanks so much, this would have taken me quite long to figure out, if at all! You are the best @mkitti ! It is exactly what we need for and I think this should also be what @ExpandingMan was looking for?
It is important to be able to read and write from general
IO
objects rather than just files. This is really important in case you need to stash files over a network, for example with AWS S3 rather than to the file system.I don't know how cooperative the HDF5 library is going to be with this. Skimming through the code, it does not look like it will be easy.