HDFGroup / h5pyd

h5py distributed - Python client library for HDF Rest API
Other
110 stars 39 forks source link

How to read HDF5 file in Vaex data frame #123

Closed nagarajmmu closed 1 year ago

nagarajmmu commented 1 year ago

Hi John

Using Vaex, I am trying to read hdf5 file from Azure blob, using below code, I am getting "FileNotFoundError: /blob_name/home/testFile_fromPython.h5"

df = vaex.open("/blob_name/home/testFile_fromPython.h5", fs=fs)

in above code if I try to read parquet/csv, I am able to read a file using Vaex, as a Data frame.

Please help me, to read hdf5 from Azure blob storage to Vaex data frame.

Thanks in advance.

jreadey commented 1 year ago

Hey,

I don't think Vaex supports reading from Azure blob, only AWS S3 and other S3 compatible cloud storage formats.
See: https://vaex.readthedocs.io/en/latest/guides/io.html.   

I think it would be very interesting to have Vaex support HSDS. Then you can have HSDS running on Azure and HSDS would deal with the storage api issues. Since Vaex uses h5py it should just be a matter of substituting h5pyd appropriately. We did something similar to these to support HSDS in the h5netcdf package.

nagarajmmu commented 1 year ago

Hi John

Thanks for the confirmation.

I am reading HDF5 file (300MB), which is in Azure blob. Reading file and converting to pandas data frame using below code, it is taking 428 seconds.

time1 = time.time() f = h5pyd.File('/home/file.hdf5', 'r') data= f['someData'][:] df = pd.DataFrame(data) print("Tital time in reading........... ", (time.time() - time1))

do you have any suggestion on this. because writing is taking 182 seconds.

jreadey commented 1 year ago

That does seem strange that the write would take longer than the read. Are you reading and writing from an Azure VM that is in the same region as the blob store?

If you can provide a link to the file, I can try it myself.

nagarajmmu commented 1 year ago

Hi John

Azure blob and the VM in which HSDS is installed, both are in same region. I am reading from my local machine, I have tried to read 2 times, 428 is the least time, that I have provided. Setup is in development environment, now allowed to access outside.

One question: VM configuration is Standard D2s v3 (2 vcpus, 8 GiB memory), is VM size is the issue.

nagarajmmu commented 1 year ago

Hi John

If Pandas data frame is taking more time, usually which data frame you preferred or people used to read data from HSDS.

Usually, if application is reading data from HSDS, which data frame is used to read HDF5 file.
Please let me know.

jreadey commented 1 year ago

Hey, I've made some changes to HSDS and h5pyd that should speed things up. Please give a try when you get a chance!

You can get the HSDS image from: hdfgroup/hsds:sha-5b17ed1 and the lastest h5pyd with pip install h5pyd --upgrade. Should by h5pyd version 0.10.3. Please let me know if you see any improvements for the Vaex loading.

jreadey commented 1 year ago

Closing this issue now - please re-open if you would like to discuss further.