bluesky / suitcase-tiff

http://nsls-ii.github.io/suitcase
Other
2 stars 5 forks source link

Accept file handles as well as file names. #2

Closed danielballan closed 5 years ago

danielballan commented 5 years ago

Expanding on a suggestion I first pitched to @awalter-bnl over DM

Our current API:

def export(filepath):
    ...

expects file to be a string. If we want to serve the exported data through a socket or serialize it in memory in StringIO, is that possible? We, the caller, don't know how many files we'll need -- generally there's one per stream plus a 'header'-like file. We could pass in a callable like this:

def export(filepath):
    ...
    for stream in streams:
        if callable(filepath):
            file = filepath()  # expected to return a writeable handle
        else:
            file = open(f'{filepath}_{stream_name}', 'w')

That would enable both the current "easy mode" usage export('path/to/data') and "flexible mode" export(StringIO). It would return a tuple (dict?) of the type corresponding to the input -- filepaths in first case, handles in the second case.

But there is a kind of asymmetry in the example above. What if I wanted to write a filepath with the stream name in it, and I want to incorporate the stream name? This API would be more flexible and consistent:

def export(filepath):
    ...
    if callable(filepath):
        file_handle_maker = filepath
    else:
        file_handle_maker = lambda stream_name: open(f'{filepath}_{stream_name}', 'w')
    for stream in streams:

The usage for the "flexible mode" would now be export(lambda stream_name: StringIO) for cases like StringIO where we don't want to do anything with stream_name and need to employ the lambda to throw it away. But I think the symmetry between the two cases is better.

I believe this addresses all the problems in suitcase-csv. But for suitcase-tiff, we are writing a mixture of text (JSON header) and binary (TIFF). We would need something like export(lambda stream_name: (StringIO, BytesIO)). Is this too overloaded? Do we want to implement two separate functions in every suitcase, rather than doing this "clever" argument type inspection? suitcase.tiff.export('path/to/data') which is a thin wrapper around suitcase.tiff.dump(lambda stream_name: StringIO, lambda stream_name: BytesIO)?

awalter-bnl commented 5 years ago

The more we get into this the more I am becoming convinced that we want to use two seperate functions, in fact I think this is a necessary change to accomodate the average user and the flexible use case. My argument here is that the average user, who has very little idea of python, wold be very quickly confused seeing lambda stream_name: StringIO .... in the doc strings and is likely to tune out or decide that this is just to complicated for them to use.

awalter-bnl commented 5 years ago

FYI the current behaviour of both suitcase-csv and suitcase-tiff is to automatically prefix the 'file_path' with _stream_name but this is not done for say suitcase-json where we can export everything as one file. Anyway, personally I think we do not want to make this optional. I am a little confused however why:

def export(filepath):
    ...
    if callable(filepath):
        file_handle_maker = filepath
    else:
        file_handle_maker = lambda stream_name: open(f'{filepath}_{stream_name}', 'w')
    for stream in streams:

would lead to an API like suitcase.tiff.dump(lambda stream_name: StringIO, lambda stream_name: BytesIO) instead of just suitcase.tiff.dump(StringIO, BytesIO). The use of the conversion from the filepath string can be taken care of prior to passing everything into suitcase.tiff.dump including the reference to stream_name can it not? I.e. I am thinking of the two functions looking like:

def {suitcase.tiff.}dump(file_handler, meta_handler):
    ...
    for stream in streams:
        file = file_handler()
        ...
    ...
    meta = meta_handler()
    ...
    return (tuple or dict mapping stream_names to output file-like object)

and

def {suitcase.tiff.}export(filepath):
    ...
        file_tiff = open(f'{filepath}_{stream_name}.tiff', 'w')
        file_json = open(f'{filepath}.json', 'w')

        return {suitcase.tiff.}dump(file_tiff, file_json)

or

def {suitcase.json.}dump(file_handler):
    ...
    for stream in streams:
        file = file_handler()
        ...
    ...
    return (tuple or dict mapping stream_names to output file-like object)
def {suitcase.Json.}export(filepath):
    ...
        file_json = open(f'{filepath}.json', 'w')

        return {suitcase.json.}dump(file_json)
danielballan commented 5 years ago

The issue is that the high-level, inflexible one should be a thin wrapper around the low-level one, and it should be possible to do anything with the low-level one that you can do with the high-level one. This requires the file handle factory function to be passed the stream name.

danielballan commented 5 years ago

In order to accommodate the variation between different formats and their needs, I think we should establish multiple functions with fixed, well-defined inputs. Any given suitcase.* may implement the relevant subset of those functions. So, if the user (or some GUI code) finds suitcase.export is defined, they can have clear expectations about how it will work, generic across formats. Likewise for others.

This list might also encompass serializers for non-file-based formats, such as suitcase.mongo, which might have a method like suitcase.mongo.insert(gen, mongo_uri).

awalter-bnl commented 5 years ago

I am happy with @danielballan's suggestions above. Essentially I think each suitcase.* will have either dump or dump_multi and that export will be a wrapper around this. I am suspecting that all of these will have both gen plus one of filepath/file handle/file handle factory args plus some kwargs that are passed down to the low level function used to do the export/dump?

It is also now clear to me that dump/dump_multi will need to accept a filepath or a file handle/file handle factory as stream_name is not known in the higher level function. I think this is OK.

tacaswell commented 5 years ago

It is probably worth looking at how python deals with other "file-of-files" cases: https://docs.python.org/3.7/library/tarfile.html , https://docs.python.org/3.7/library/zipfile.html?highlight=zip#zipfile-objects (which have slightly different APIs :stuck_out_tongue_winking_eye: )

Internally maybe we use something like

class MultiFileWrapper:
    def __init__(self, prefix, ...):
        ...
    def open(self, label, postfix, *, mode='wb') -> FileLikeContextManager:
        # this should be the main workhorse
        # do we require label to be unique?
        ....
    def reserve_name(self, label, postfix) -> str:
        # just in case something _really_ wants a filename (like h5py < 2.9)
        ...
    def make_dir(self, label, postfix) -> MultiFileWrapper:
        # maybe?
        ...
    def summary(self) -> Dict[str, Union[str, FileLike]:
        # returns a mapping between label and {the abs path to file or the FileLike}
        # if we have `make_dir` do we flatten or recurse?
        ...

or something like that. The wrapper would be responsible for appending the postfix from the methods onto the prefix from the init to get the absolute path and/or generating the right StringIO or BytesIO objects, keeping track of what files/buffers it has created, making sure it never over-writes its own files, enforcing what ever bookkeeping we want to put on label etc.

I imagine that the main use of the suitcases to not be end-users typing at the prompt, but automated tools more akin to the file-management system at APS. One of the things I was thinking about just now was how would we want to produce a web page to view the artifacts generated by export? For a single-file it is just a link to the file + some header information. For a multi-file artifact we probably want to make each of the sub-files available independently + a tarball of everything + some header information. For that we probably want some semantic labels (hence 'label') in addition to the postfix (I also assume the exporter may template the postfix heavily so it is nice to have some stability in the return type between calls to the same suitcase (even if there is little between suitcases)).

I imagine we will have at least 2 of these (all-files-all-the-time and all-buffers-all-the-time), but variants that automatically generate tar or zip files would be interesting! But, the win is that we do not need to have a bunch of different functions that take slightly different inputs (the to-database ones are still a bit weird, but you probably would not now to drop-in-replace those with the file based ones). We could go with the heuristic that if export gets a string, we use that as the prefix to build a all-files-all-the-time wrapper, otherwise we assume that the input is one of these Wrapper classes.

I don't see any way to get the file-like input to work with exporters that want to write mulitple files as a) there is the text vs binary mode issues and the "if I just shove all the bytes in one stream where are the edges" issue (which if we solved would effectively be re-inventing {tar, zip, rar}) and I think a well defined wrapper class like above is preferable to many (optionally implemented) export functions.

danielballan commented 5 years ago

It had not occurred to me that we could look to zipfile and tarfile libraries for examples of how Python wants us to do this. Good call.

I think a well defined wrapper class like above is preferable to many (optionally implemented) export functions.

Agreed. As I had expected and hoped, you have made a suggestion that I like better than mine. :-)

I agree that:

the main use of the suitcases to not be end-users typing at the prompt, but automated tools more akin to the file-management system at APS.

with the qualification that I think it's still important to find a way to implement suitcase.*.export(filename) to have a good story around "easy user-facing export capability" in 2019. But:

We could go with the heuristic that if export gets a string, we use that as the prefix to build a all-files-all-the-time wrapper, otherwise we assume that the input is one of these Wrapper classes.

works. @awalter-bnl, do you feel up to refactoring the file management logic into a class with the methods outlined above?

awalter-bnl commented 5 years ago

I imagine that the main use of the suitcases to not be end-users typing at the prompt, but automated tools more akin to the file-management system at APS.

I have the opposite opinion on this, and think this is the main use case for which these functions have been requested by users. We ignore, or complicate, this use-case at our peril.

We could go with the heuristic that if export gets a string, we use that as the prefix to build a all-files-all-the-time wrapper, otherwise we assume that the input is one of these Wrapper classes.

My above statement aside I am not against the approach above because it gives us the same easy use case, and we can make the doc-string simpler by referencing a link to the MultiFileWrapper documentation.

Just so I am clear, essentially we are suggesting introducing a series of MultiFileWrappers around each file handler factory (like StringIO). We would need 2 such instances of this class per file handler factory, one to replace @danielballan's dump and one to replace the dump_multi?

If this is what is being suggested then I am happy to take a go at it, will probably take some time however as it is not a small amount of work.

danielballan commented 5 years ago

What @tacaswell is suggesting:

from pathlib import Path
import collections

class MultiFileWrapper:
    """
    A class that manages multiple files.

    This design is inspired by Python's zipfile and tarfile libraries.
    """
    def __init__(self, directory):
        self._directory = Path(directory)
        self._reserved_names = set()
        self._artifacts = collections.defaultdict(list)

    @property
    def artifacts(self):
        return dict(self._artifacts)

    def reserve_name(self, label, postfix):
        """
        Ask the wrapper for a filepath.

        An external library that needs a filepath (not a handle)
        may use this instead of the ``open`` method.

        Parameters
        -------
        label : string
            partial file name (i.e. stream name)
        postfix : string
            relative file path and filename

        Returns
        ----------
        name : Path
         """
        if Path(postfix).is_absolute():
            raise SuitcaseValueError(f"{postfix} must be structured like a relative file path.")
        name = (self.directory / Path(postfix)).expanduser().resolve()
        if name in self._reserved_names:
            raise SuitcaseValueError(f"The postfix {postfix!r} has already been used.")
        self._reserved_names.add(name)
        self._artifacts[label].append(name)
        return name

    def open(self, label, postfix, mode, encoding=None, errors=None):
        """
        Request a file handle.

        Like the built-in open function, this may be used as a context manager.

        Parameters
        -------
        label : string
            partial file name (i.e. stream name)
        postfix : string
            relative file path and filename
        mode : {'x', 'xt', xb'}
            'x' or 'xt' for text, 'xb' for binary
        encoding : string or None
            Passed through open.  See Python open documentation for allowed values. Only applicable to text mode.
        errors : string or None
            Passed through to open. See Python open documentation for allowed values.
       Returns
       -------
       file : handle
        """
        filepath = self.reserve_name(label, postfix)
        # TODO Check that mode is one of the mode we allow (per docstring).
        return open(filepath, mode, encoding=encoding, errors=errors)

def export(gen, directory, ...):
    """
    ...

    directory : string, Path, or Wrapper
        ...
    """

    if isinstance(directory, (str, Path)):
        wrapper = MultiFileWrapper(directory)
    meta_file = wrapper.open('run metadata', 'meta.csv', 'x')
    for stream_name in stream_names:
        file = wrapper.open('stream_data', f'{stream_name.csv}', 'x')
        ...
    ...
    return wrapper.artifacts

Note that files opened for the TIFF exporter will need the mode 'xb' instead of 'x'. We can write and pass in own own wrappers later as applications demand. Here's what a StringIO/BytesIO one would look like:

import io

class MemoryBuffersWrapper:
    """
    A class that manages multiple StringIO and/or BytesIO instances.

    This design is inspired by Python's zipfile and tarfile libraries.

   This has a special buffers attribute which can be used to retrieve
   buffers created.
    """
    def __init__(self):
        self._reserved_names = set()
        self._artifacts = collections.defaultdict(list)
        self.buffers = {}  # maps postfixes to buffer objects

    @property
    def artifacts(self):
        return dict(self. _artifacts)

    def reserve_name(self, label, postfix):
        """
        Ask the wrapper for a filepath.

        An external library that needs a filepath (not a handle)
        may use this instead of the ``open`` method.

        Parameters
        -------
        label : string
            partial file name (i.e. stream name)
        postfix : string
            relative file path and filename

        Returns
        ----------
        filepath : Path
         """
        raise SuitcaseTypeError(
            "MemoryBuffersWrapper is incompatible with exporters that require explicit filenames.")

    def open(self, label, postfix, mode, encoding=None, errors=None):
        """
        Request a file handle.

        Like the built-in open function, this may be used as a context manager.

        Parameters
        -------
        label : string
            partial file name (i.e. stream name)
        postfix : string
            relative file path and filename
        mode : {'x', 'xt', xb'}
            'x' or 'xt' for text, 'xb' for binary
        encoding : string or None
            Not used. Accepted for compatibility with built-in open().
        errors : string or None
            Not used. Accepted for compatibility with built-in open().

       Returns
       -------
       file : handle
        """
        # Of course, in-memory buffers have no filepath, but we still expect postfix to be
        # a thing that looks like a relative filepath, and we use it as a unique identifier for a given buffer.
        if Path(postfix).is_absolute():
            raise SuitcaseValueError(f"{postfix} must be structured like a relative file path.")
        name = Path(postfix).expanduser().resolve()
        if name in self._reserved_names:
            raise SuitcaseValueError(f"The postfix {postfix!r} has already been used.")
        self._reserved_names.add(name)
        self._artifacts[label].append(name)
        if mode in ('x', 'xt'):
            buffer = io.StringIO()
        elif mode == 'xb':
            buffer = io.BytesIO()
        else:
            raise SuitcaseValueError("mode must be 'x', 'xt' or 'xb'")
        self.buffers[postfix] = buffer
        return buffer

and the usage with export would be simply:

wrapper = MemoryBufferWrapper()
suitcase.*.export(gen, wrapper)
# Use wrapper.artifacts and wrapper.buffers to sort through the buffers and use them.

Perhaps these wrappers, which are fully generic across different exporters should live in a new repo called suitcase.utils on which all suitcases depends. Seems a bit too export-specific to live in event-model.

To summarize some features of these wrappers:

awalter-bnl commented 5 years ago

This has been resolved with PR#1

danielballan commented 5 years ago

Merged in https://github.com/NSLS-II/suitcase-utils/pull/2