HDFGroup / h5pyd

h5py distributed - Python client library for HDF Rest API
Other
109 stars 39 forks source link

Convenience functions and behavior modification of Folder object #75

Open JacobDuckett opened 4 years ago

JacobDuckett commented 4 years ago

After using h5pyd for a while, I'd like to propose a few new functions and changes to the Folder object. I believe these changes allow for more fluid file management programs and match the h5py syntax better.

New functions/properties:

  1. Folder.create_subfolder(subfolder_name, **kwargs):

    • creates subfolder in the current folder. Returns new folder object. Raises error if current folder permissions do not permit writing.
  2. Folder.create_subdomain(subdomain_name, **kwargs):

    • creates subdomain in the current folder. Returns new domain object. Raises error if current folder permissions do not permit writing.
  3. Folder.load(local_filepath/filepaths, **kwargs):

    • loads local file to current domain. Basically calls h5pyd._apps.utillib.load_file to the current domain.
  4. Folder.subfolders: => property

    • returns a list of child folder objects
  5. Folder.subdomains: => property

    • returns a list of child domain/h5file objects
  6. Folder.subfolders: => property

    • returns a list of child folder objects
  7. Folder.info => property

    • returns what Folder.__repr__() is currently returning, i.e. dict of object information

Changes to existing functions:

  1. Folder.__getitem__(path):

    • return object of item at the given path, e.g. a user with some folder object x could index into x['path/to/my/data.h5']
  2. Folder.__str__():

    • return either Folder.domain or Folder.info, not sure about this one.
  3. Folder.__repr__():

    • return <HSDS folder {full_url} (mode {mode})> to match standard h5py/p5pyd convention
  4. Folder.parent: => property

    • return parent object instead of parent domain

I can build a branch in my fork incorporating these changes along with general SOH updates like PEP8 compliance and standardized docstrings. Is there a specific docstring style that is preferred?

As for testing, should I continue to add lines into the three existing tests or would you prefer a more granular approach, i.e. a test case for each individual function? The current tests don't necessarily need to be removed, but I would think a more granular approach would highlight errors more clearly.

jreadey commented 4 years ago

These all sound like useful additions excepting the load method. I'm a bit hesitant to have circular dependencies between the _hl and _apps code.

I'd suggest adding tests to the existing test_folder for now. We can refactor if it gets unwieldy large.

I'm afraid the current code base is not very consistent as far as docstrings go. Anyway, let's follow the format that h5py uses.

JacobDuckett commented 4 years ago

Ok, I can skip the load method for now until we figure out a better strategy. It might be that _apps can be refactored at some point to move the functional capabilities over to _hl so that _apps largely consists of business logic for the specific scripts.

JacobDuckett commented 4 years ago

For the .info property I'm just calling self._http_conn.GET('/') and that seems to be working fine, but I'm unsure what to do in the cases that the Folder is the root directory or that the user actually gave an endpoint for a file. Should the Folder object even initialize if the given path is a File object?

JacobDuckett commented 4 years ago

I've also seen multiple cases where a File object is referred to as a domain. I had made a mistake in my original post and would like to add methods to return all subfolders via a .subfolders property, all subdomains (files and folders) via a .subdomains property, and all subfiles (h5 domains) via some property. Any recommendations on naming or should I just go with a .subfiles property?

jreadey commented 4 years ago

Can't the info returned by the GET in the constructor be used for info?
For now, lets only allow initialization where the argument is a string pointing to a domain.

I'm thinking about enabling h5pyd to support direct access to Files, but we can deal with that later. I do think it would be useful to provide feedback that indicates the folder is not pointing to a file. In h5pyd File, we support using the prefix "hdf5://" to distinguish posix files from HSDS domains (see line 134 in files.py). Having the equivalent support in Folder would make sense. I'd suggest that info then return "hdf5://".

I've attempted (with lots of omissions) to use the word "Domain" rather then "File" so that HSDS domains are not confused with HDF5 posix files. If we have a subFile method in Folder that might cause more confusion. Rather than a new method, how about an optional parameter on subdomains (e.g. "include_folders=True")?

Re: the root Folder - it's a bit of a special case. It has no settable properties other than it's subfolders (there's no JSON object that corresponds to the root Folder). So I think the following will need to be handled in code:

JacobDuckett commented 4 years ago

Ok, I've got the constructor storing the relevant information from the GET into hidden attributes that get called from .info now. It may be that the lastModified attribute needs to get modified down the road to handle real-time monitoring, but for now that can be handled by just creating a new Folder instance.

It seems that there are two competing syntax issues - one between HDF5 posix files and HSDS domains/files and another between H5PYD domains (Files) and HSDS domains (Files and Folders). I have the .subfiles and .subdomains properties working but I could modify it to just all fall under a subdomains function that includes arguments to include files and folders. I have found that returning a list of the subfolders/subdomains can be somewhat slow due to each constructor making a separate request. It might be worth it to look at initializing Folders & Domains asynchronously or providing an alternate constructor for instances where the relevant information from the GET_Domain information is already known (e.g. calling `http_conn.GET('/domains') from the parent folder).

I'm thinking about enabling h5pyd to support direct access...

I'm not sure I understand this paragraph. I was under the assumption that h5py is used for direct POSIX access and h5pyd is used for HTTP. Is the thought that the interfaces/packages may merge in the future?

I'm still unsure of whether a Folder should instantiate if the object at the given domain is not of class folder (lines 166-168 in folders.py). I'd have to update some of what I've made to handle this scenario, but I'm unsure of whether this is intended behavior or the use case.

jreadey commented 4 years ago

About h5pyd direct access...

Currently h5pyd get data by making requests to an HSDS (or h5serv) server. In turn the HSDS server is reading data in the object storage schema (https://github.com/HDFGroup/hsds/blob/master/docs/design/obj_store_schema/obj_store_schema_v2.md). There are a lot of advantages to this type of client-server architecture, but there are some challenges to. E.g. you need to scale the server to match the load from the clients.

Direct access would add a capability to h5pyd so that it could directly read from the storage medium. You will (hopefully) get the same result but without having to make any server requests.

If you are running a job with 100's or 1000's of workers, it may be more practical to run these using the direct access model vs. accessing via the server.

All this is a bit orthogonal from how the schema objects are actually stored (POSIX Files vs Object Storage objects). HSDS currently requires an object storage system, but it wouldn't be much work to enable the schema to be stored in a POSIX filesystem as well.

ghost commented 4 years ago

Maybe too late for this but would prefer dropping "sub" from "subfolders" and "subdomains".

JacobDuckett commented 4 years ago

I'm fine with that considering it's hitting the /domains endpoint, domains would be more transparent.

JacobDuckett commented 4 years ago

Considering the naming conventions I'm leaning towards just making a property for folders and domains that return a generator for each object class. Users can convert them to lists if needed or chain them together if needed.