HDFGroup / h5pyd

h5py distributed - Python client library for HDF Rest API
Other
110 stars 39 forks source link

Problem reading data back from h5serv #104

Closed hickey closed 2 years ago

hickey commented 2 years ago

I threw a quick script together to validate that my installation of h5serv is working fine and I am having trouble with reading data back from h5serv. I am new to working with HDF5 files and using h5py/h5pyd modules, so it could be just something that I and not understanding.

The script that I am using is as follows:

#!/usr/bin/env python3

import h5py_switch
import numpy as np
import argparse
import logging
from pprint import pprint

def open_hdf5_file(args, mode):
    opts = {}

    if args.endpoint:
        opts['endpoint'] = args.endpoint
    if args.username:
        opts['username'] = args.username
        opts['password'] = args.password

    return h5py_switch.File(args.hdf5, mode, **opts)

if __name__ == '__main__':

    parser = argparse.ArgumentParser()
    parser.add_argument('--write', '-w', action='store_true',
        help='Write data to an HDF5 file')
    parser.add_argument('--read', '-r', action='store_true',
        help='Read data from an HDF5 file')
    parser.add_argument('--input', '--in', '-i', metavar='FILE',
        help='Data input for writes')
    parser.add_argument('--output', '--out', '-o', metavar='FILE',
        help='Data output for reads')
    parser.add_argument('--endpoint', '--ep', '-e', default=None,
        help='HTTP endpoint for h5serv')
    parser.add_argument('--username', '--user', '-u', default=None,
        help='Username to authenticate with')
    parser.add_argument('--password', '--pass', '--passwd', '-p', default=None,
        help='Password for authenticating to HDF5 store')
    parser.add_argument('--token', '-t', default=None,
        help='API authentication token')
    parser.add_argument('--debug', '-d', action='store_true',
        help='Generate a log of actions taken')
    parser.add_argument('hdf5', metavar='HDF5_FILE')

    args = parser.parse_args()

    if args.debug:
        loglevel = logging.DEBUG
        logging.basicConfig(format='%(asctime)s %(message)s', level=loglevel)

    if args.write:

        f = open_hdf5_file(args, 'w')

        # dset = f.create_dataset('dset', [8, 8],  dtype='int8')
        # logging.info("name: %s" % dset.name)
        # logging.info("shape: %s" % dset.shape)
        # logging.info("chunks: %s" % dset.chunks)
        # logging.info("dset.type: %s" % dset.dtype)
        # logging.info("dset.maxshape: %s" % dset.maxshape)

        rng = np.random.default_rng(seed=42)
        data = rng.random((8,8))
        print(data)
        f.create_dataset('dset2', data=data, dtype='float32')

        f.close()

    if args.read:
        f = open_hdf5_file(args, 'r')

        print("keys = %s" % list(f.keys()))
        for k in f.keys():
            #print("=== %s  shape: %s  id: %s ===" % (k, f[k].shape, f[k].id))
            print("=== %s id: %s ===" % (k, f[k].id))
            arr = np.array(f["/"+k])
            pprint(arr)
            #pprint(f["/"+k].values()[0])

I execute the script and write to h5serv as follows:

root@8b23dea41fe5:/code# ./hdf5-test.py --username xxxx --password xxxx --endpoint https://h5.wt0f.com -w test6.wt0f.com
[[0.77395605 0.43887844 0.85859792 0.69736803 0.09417735 0.97562235
  0.7611397  0.78606431]
 [0.12811363 0.45038594 0.37079802 0.92676499 0.64386512 0.82276161
  0.4434142  0.22723872]
 [0.55458479 0.06381726 0.82763117 0.6316644  0.75808774 0.35452597
  0.97069802 0.89312112]
 [0.7783835  0.19463871 0.466721   0.04380377 0.15428949 0.68304895
  0.74476216 0.96750973]
 [0.32582536 0.37045971 0.46955581 0.18947136 0.12992151 0.47570493
  0.22690935 0.66981399]
 [0.43715192 0.8326782  0.7002651  0.31236664 0.8322598  0.80476436
  0.38747838 0.2883281 ]
 [0.6824955  0.13975248 0.1999082  0.00736227 0.78692438 0.66485086
  0.70516538 0.78072903]
 [0.45891578 0.5687412  0.139797   0.11453007 0.66840296 0.47109621
  0.56523611 0.76499886]]

I then turn around and try to read the data back in:

root@8b23dea41fe5:/code# ./hdf5-test.py --username xxxx --password xxxx --endpoint https://h5.wt0f.com -r test6.wt0f.com
keys = ['dset2']
=== dset2 id: <h5pyd._hl.objectid.DatasetID object at 0x7f4a8e6e2b20> ===
array([[1.1301152e-29, 3.0915447e-41, 0.0000000e+00, 0.0000000e+00,
        0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00],
       [0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
        0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00],
       [0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
        0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00],
       [0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
        0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00],
       [0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
        0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00],
       [0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
        0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00],
       [0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
        0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00],
       [0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
        0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00]],
      dtype=float32)

I have also copied the file from the server running h5serv to the local directory and tried to read the contents of the file:

root@8b23dea41fe5:/code# ./hdf5-test.py -r test6.h5
keys = ['__db__', 'dset2']
=== __db__ id: <h5py.h5g.GroupID object at 0x7f6bdb2d2f90> ===
array(['{addr}', '{ctime}', '{datasets}', '{datatypes}', '{groups}',
       '{mtime}'], dtype='<U11')
=== dset2 id: <h5py.h5d.DatasetID object at 0x7f6bdb2d2f90> ===
array([[0.77395606, 0.43887845, 0.85859793, 0.697368  , 0.09417735,
        0.97562236, 0.7611397 , 0.7860643 ],
       [0.12811363, 0.45038593, 0.37079802, 0.92676497, 0.6438651 ,
        0.8227616 , 0.4434142 , 0.22723871],
       [0.5545848 , 0.06381726, 0.8276312 , 0.6316644 , 0.75808775,
        0.35452595, 0.970698  , 0.8931211 ],
       [0.7783835 , 0.19463871, 0.466721  , 0.04380377, 0.1542895 ,
        0.68304896, 0.7447622 , 0.96750975],
       [0.32582536, 0.3704597 , 0.46955582, 0.18947136, 0.12992151,
        0.47570494, 0.22690935, 0.669814  ],
       [0.4371519 , 0.8326782 , 0.7002651 , 0.31236663, 0.8322598 ,
        0.80476433, 0.38747838, 0.2883281 ],
       [0.6824955 , 0.13975248, 0.1999082 , 0.00736227, 0.78692436,
        0.66485083, 0.7051654 , 0.78072906],
       [0.45891577, 0.5687412 , 0.139797  , 0.11453007, 0.66840297,
        0.47109622, 0.5652361 , 0.76499885]], dtype=float32)

As you can see, the file itself seems to be just fine.

Note: I am using the h5py_switch module that will return at h5pyd.File object when I am accessing the file on the h5serv server and a h5py.File object when I read the file locally. Not sure if this is really significant or not, but I figured I would call it out.

Wondering if there is anything obvious that I am doing wrong in my test script or if there is an easy way to determine if the problem is with h5serv or the h5pyd module.

hickey commented 2 years ago

I have to also note (although it really should be another issue) that the endpoint parameter acts differently if there is a slash at the end of the value.

>>> f = h5pyd.File('test4', 'r', endpoint='https://h5.wt0f.com/', username='xxxx', password='xxxxx')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/site-packages/h5pyd/_hl/files.py", line 236, in __init__
    raise IOError(rsp.status_code, rsp.reason)
OSError: [Errno 400] Bad Request
>>> f = h5pyd.File('test4', 'r', endpoint='https://h5.wt0f.com', username='xxxx', password='xxxx')
>>>

That just seems wrong to me. If there is a reason that a final slash is not acceptable, should there be a test to remove the final slash if it is supplied?

hickey commented 2 years ago

OK, part of this is my limited knowledge of numpy and working with data frames. Found that if I use an index notation of [...] or [:] I was able to get the correct data to display. I was using some of the Jupyter notebooks in the examples directory as a guide for accessing the h5serv server. So I am not sure where I was led astray with the examples.

jreadey commented 2 years ago

Hey, @hickey - glad you sorted this out! Sorry, for my lack of response, I haven't had time to help out with h5serv recently. Our (The HDF Group) main focus is HSDS (https://github.com/HDFGroup/hsds) which is more of "new generation" HDF service.

Are there specific reasons you have for using h5serv rather than HSDS? It would be nice to have everyone move over to HSDS.

hickey commented 2 years ago

Mostly that I did not know about HSDS. Looking now.

hickey commented 2 years ago

Just a thought, you may want to update the README to start directing people over to the HSDS project.

hickey commented 2 years ago

I have to change my statement above about directing people over to the HSDS project. I finally got around to starting to bring up the HSDS docker container. While I can see where the HSDS project is going it is a much bigger scale than what I need and I suspect bigger than what others may need. So I would incorporate that into the README that if one is just sharing a couple HDF5 files or don't need to scale out to support hundreds or thousands of clients then it is probably just as well to stay with this project.

jreadey commented 2 years ago

Interesting - is it that HSDS seems more complicated to spin up compared to h5serv (I would have thought they were fairly equivalent)?

BTW - probably the easiest way to share some files is to just put them in a public S3 bucket. Users can either download the files or use s3fs (python) or s3ros vfd (for C/C++) to read directly.

hickey commented 2 years ago

Well if I were to use HSDS I would need to set up a service node and a data node. In addition, I need to have an S3 bucket to connect to the data node. A whole not more infrastructure than what I need.

Using an S3 bucket directly is not as desirable as I have to pull the HDF5 file down, use it and then upload the file again. A whole lot more operations (and chances for failure modes) than to just slurp the data in through an HDF5 client, process it and then save the data again. I can have much better error handling from code than trying to interpret why any of the S3 transfer utilities exited out.

jreadey commented 2 years ago

Right - there are more containers, but it all managed for you by the runall.sh script. And rather than a S3 bucket, you can just have a directory on your server for data storage. See: https://github.com/HDFGroup/hsds/blob/master/docs/docker_install_posix.md.

But I do think h5serv has the edge in terms of hosting a set of existing HDF5 files. With HSDS you either need to convert them into the HSDS sharded format (using hsload) or extract the file metadata (with hsload --link).

I've been thinking it would nice to be able to have HSDS just use an existing set of HDF5 files as is - but will need to think about that a bit.