JuliaIO / HDF5.jl

Save and load data in the HDF5 file format from Julia
https://juliaio.github.io/HDF5.jl
MIT License
390 stars 143 forks source link

reading and writing from generic `IO` objects #552

Open ExpandingMan opened 5 years ago

ExpandingMan commented 5 years ago

It is important to be able to read and write from general IO objects rather than just files. This is really important in case you need to stash files over a network, for example with AWS S3 rather than to the file system.

I don't know how cooperative the HDF5 library is going to be with this. Skimming through the code, it does not look like it will be easy.

ggggggggg commented 5 years ago

Can you be more specific, eg do you want to do something like this?

s3_bucket_with_hdf5_file_in_it = S3.openbucket()
h5 = HDF5.h5open(s3_bucket_with_hdf5_file_in_it, "rw")
h5["a"]=4
x = h5["b"]
ExpandingMan commented 5 years ago

In that case it would involve writing to a IOBuffer object, taking the Vector{UInt8} buffer and sending it to AWS via HTTP. Something like

io = IOBuffer()
get_data_from_s3!(s3, io)
h5 = h5open(io)

I suppose.

baumgold commented 2 years ago

Hi @ExpandingMan. Did you ever manage to find a solution to being able to read HDF5 files from an S3 object store? The HDF5/S3 connector may be useful here, but I'm not sure if this has since been solved in a different way. Thanks!

mkitti commented 2 years ago

The canonical way would be to use HDF5 virtual file drivers. https://docs.hdfgroup.org/hdf5/v1_12/_v_f_l.html

baumgold commented 2 years ago

@mkitti - that’s what I suspected. Any idea if this is available/integrated with HDF5.jl? My understanding is virtual file drivers need to be selected at HDF5 compile-time. I presume we’ll need some changes from HDF5_jll to get this support?

mkitti commented 2 years ago

We started adding support for drivers here: https://github.com/JuliaIO/HDF5.jl/blob/master/src/drivers/drivers.jl

We may be able to use the Core driver to read an I/O stream completely into memory and use that.

ExpandingMan commented 2 years ago

I haven't had many occasions to use HDF5, but when I did I was certainly resorting to temp files, which is certainly not ideal. In linux it's very easy to do this all in-memory (you can use /dev/shm or another in-memory directory) there is probably still a lot of overhead to that, so it's in no way an ideal solution.

gszep commented 2 years ago

at the very least this lib can support the ROS3 driver written by the HDFgroup? Perhaps following this python-equivalent PR: https://github.com/h5py/h5py/pull/1755. I recommend aiming for the following solution:

h5open(s3path; driver=Drivers.ROS3()) do file
    file
end
mkitti commented 2 years ago

The ROS3 driver seems quite distinct from the rest of the issue. Could you create a new issue, please?

denglerchr commented 1 year ago

Hello, I am trying to send and receive some HDF5 files via network without writing to a file. I think the only way to do this would also be to read it from a generic IO or from a byte array. Is there any update on this, can this be done at this point?

mkitti commented 1 year ago

I think we might be able to do this via H5FD_CORE via HDF5.Drivers.Core and HDF5.API.h5p_set_file_image

mkitti commented 1 year ago

See also https://portal.hdfgroup.org/display/HDF5/HDF5+File+Image+Operations#HDF5FileImageOperations-1.IntroductiontoHDF5FileImageOperations

Basically, I think we have exposed the underlying low-level C API to do this in Julia, but have not created a high level API for this.

denglerchr commented 1 year ago

Thanks, unfortunately I am not familiar with the low-level API at all, but Ill see if I can get this to work somehow. I found the h5py supports this already, maybe one day this could work in Julia as well with just passing an IO object instead of a filename? From https://docs.h5py.org/en/stable/high/file.html?highlight=driver#h5py.File.driver

"""Create an HDF5 file in memory and retrieve the raw bytes

This could be used, for instance, in a server producing small HDF5
files on demand.
"""
import io
import h5py

bio = io.BytesIO()
with h5py.File(bio, 'w') as f:
    f['dataset'] = range(10)

data = bio.getvalue() # data is a regular Python bytes object.
print("Total size:", len(data))
print("First bytes:", data[:10])
mkitti commented 1 year ago

I looked into how they implemented that. They implemented a virtual file driver: https://github.com/h5py/h5py/blob/2e95e93b1331fd6b9c43dea38c863642624d319c/h5py/h5fd.pyx#L87-L101

In [77]: with h5py.File(bio, "w") as f:
    ...:     print(f._id.get_access_plist().get_driver())
    ...: 
576460752303423496

In [78]: h5py.h5fd.fileobj_driver
Out[78]: 576460752303423496

This is a bit overkill if all you need to do is read it into memory though.

denglerchr commented 1 year ago

Would implementing something like in h5py make it to the roadmap of this package for the near future? We have a project that would require such data over network in the future and HDF5 was used previously. An alternative might be to send the data flattened as vectors in the Arrow format though.

mkitti commented 1 year ago

We have a project that would require such data over network in the future and HDF5 was used previously

Have you considered the ROS3 (read only S3) driver? Do you need write capability over the network as well?

Another approach is detailed here: https://medium.com/pangeo/cloud-performant-netcdf4-hdf5-with-zarr-fsspec-and-intake-3d3a3e7cb935

It might be good to fully understand what you mean by network here and what your requirements are for access. Are you using chunked datasets with compression? Is read-only ok, do you do read-write? To what degree does this have to scale.

Would implementing something like in h5py make it to the roadmap of this package for the near future?

The custom file driver approach does not seem very hard to do. It's actually significantly easier to do from Julia, so it's mainly about time and priorities.

We're gearing up for a 0.17 breaking release, so that's where my focus is at the moment.

simonbyrne commented 1 year ago

A simple alternative would be to write to a RAM disk, then copy it over.

denglerchr commented 1 year ago

Our application is an R&D project that involves a line scanner (basically laser+high def camera) and reading from network would be enough in first step. The data is collected using a C++ program and then distributed to consumers in a batch approx every second over MQTT (local network only). Some analysis of the collected batch is then to be done in julia and the result forwarded using Mqtt again. The data would be 4 matrices of around 300x4000 Float32 values every second, we wanted to use HDF5 files with blosc-lz4 compression.

denglerchr commented 1 year ago

C++ part is still to be adapted anyway, currently working on specification where I describe this data exchange. I am leaning towards Arrow tbh, but I will try a bit more with HDF5 as this was used in a similar project. Maybe PyCall and h5py would also be a solution.

mkitti commented 1 year ago

OK, you nerd sniped me. Here is a demonstration of the Core driver

julia> using HDF5, H5Zblosc, CRC32c

julia> checksum(dataset) = crc32c(copy(reinterpret(UInt8, dataset[:])))
checksum (generic function with 1 method)

julia> function create_file_inmemory(dataset = rand(1:10, 256, 256))
           @info "Dataset Checksum" checksum(dataset)

           # Create File Access Property List
           fapl = HDF5.FileAccessProperties()
           fapl.driver = HDF5.Drivers.Core(; backing_store=false)

           # Create file in memory
           name = "inmemtest"
           fid = HDF5.API.h5f_create(name, HDF5.API.H5F_ACC_EXCL, HDF5.API.H5P_DEFAULT, fapl)
           h5f = HDF5.File(fid, name)
           write_dataset(h5f, "lz4_comp_dataset", dataset, chunk=(16,16), filters=BloscFilter())
           HDF5.API.h5f_flush(h5f, 1)

           # Get file image
           buf_len = HDF5.API.h5f_get_file_image(h5f, C_NULL, 0)
           inmemfile = Vector{UInt8}(undef, buf_len)
           HDF5.API.h5f_get_file_image(h5f, inmemfile, length(inmemfile))

           # Finish
           close(h5f)
           return inmemfile
       end
create_file_inmemory (generic function with 2 methods)

julia> function read_file_inmemory(inmemfile::Vector{UInt8})
           # Create File Access Property List
           fapl = HDF5.FileAccessProperties()
           fapl.driver = HDF5.Drivers.Core(; backing_store=false)
           HDF5.API.h5p_set_file_image(fapl, inmemfile, length(inmemfile))

           # Open the file in memory
           name = "inmemtest"
           fid = HDF5.API.h5f_open(name, HDF5.API.H5F_ACC_RDONLY, fapl)
           h5f = HDF5.File(fid, name)
           display(h5f)
           dataset = h5f["lz4_comp_dataset"][]

           # Finish
           close(h5f)
           @info "Dataset Checksum" checksum(dataset)
           return dataset
       end
read_file_inmemory (generic function with 1 method)

julia> inmemfile = create_file_inmemory();
┌ Info: Dataset Checksum
└   checksum(dataset) = 0x3ecfcea2

julia> read_file_inmemory(inmemfile);
🗂️ HDF5.File: (read-only) inmemtest
└─ 🔢 lz4_comp_dataset
┌ Info: Dataset Checksum
└   checksum(dataset) = 0x3ecfcea2

julia> write("ondisk.h5", inmemfile)
117208

julia> run(`h5ls -v ondisk.h5`)
Opened "ondisk.h5" with sec2 driver.
lz4_comp_dataset         Dataset {256/256, 256/256}
    Location:  1:800
    Links:     1
    Chunks:    {16, 16} 2048 bytes
    Storage:   524288 logical bytes, 100352 allocated bytes, 522.45% utilization
    Filter-0:  blosc-32001 OPT {2, 2, 8, 2048, 5, 1, 0}
    Type:      native long
Process(`h5ls -v ondisk.h5`, ProcessExited(0))
mkitti commented 1 year ago

https://github.com/JuliaIO/HDF5.jl/pull/1077 should make reading and writing files from memory easier.

denglerchr commented 1 year ago

Wow, thanks so much, this would have taken me quite long to figure out, if at all! You are the best @mkitti ! It is exactly what we need for and I think this should also be what @ExpandingMan was looking for?