asdf-format / asdf

ASDF (Advanced Scientific Data Format) is a next generation interchange format for scientific data
http://asdf.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
523 stars 57 forks source link

Customize paths of external blocks #152

Open rossant opened 9 years ago

rossant commented 9 years ago

Is it possible to customize the filenames and subdirectories where the exploded block files are saved?

mdboom commented 9 years ago

Not presently with the explode command, but there's nothing about the file format that would prevent it.

Can you describe in more detail what you'd like to do?

embray commented 9 years ago

I wonder, for the sake of consistency/sanity, the ASDF standard shouldn't specify a default naming scheme for the files produced by "exploding" a file to exploded form, while giving libraries the option to use a different scheme (left up to the implementation) if requested by the user.

rossant commented 9 years ago

while giving libraries the option to use a different scheme (left up to the implementation) if requested by the user.

do you intend to do that in pyasdf?

mdboom commented 9 years ago

Would the specification of a destination pattern be enough? For example:

some_directory/{source}_{block_no}.asdf

where {source} is replaced with the original root filename, and {block_no} is replaced with the block number?

By this convention, the current behavior would be defined as {source}{block_no}.asdf.

embray commented 9 years ago

That's sort of what I was thinking too. If just a directory destination is given it could use the default pattern. But allowing a user-specified pattern (including the directory) would work too.

rossant commented 9 years ago

actually in our case it would be more complicated, since we'd want to use a subdirectory structure based on the hierarchy in the Tree

mdboom commented 9 years ago

Can you describe your use case in more detail? I think that may break down if data in a block is shared between multiple arrays in the tree.

embray commented 9 years ago

I think writing out individual child-objects in a hierarchical data structure is a different use case than what exploded form is for.

embray commented 9 years ago

To make a FITS analogy, exploded form is (somewhat) like writing the FITS header and the binary data to separate files. Whereas I think what @rossant is asking is more akin to writing each HDU to a separate file (albeit with a directory structure representing hierarchy that doesn't exist in FITS, but may in ASDF). That may be a little too application specific, but sounds worth talking about.

rossant commented 9 years ago

Long story short, we're looking for a format for neurophysiology data that enables easy discovery of key data arrays. For a given dataset, we have a hierarchy of data arrays, but only 1 or 2 are used by 95% of our users. Having explicit names for the files would let a typical user find these important arrays easily.

Here's an example. You're a typical user, you have a dataset, and you don't know anything about the format. You see a subdirectory named spike_times containing a binary array and a metadata JSON file with the array's information (dtype, shape, etc.). Then you should be able to open that array with no difficulty in any programming language (typically MATLAB, which is still one of the dominant languages in the community...)

So far we've been using HDF5, but we're having way too many problems. Accessibility is bad; you need an HDF5 library in order to see what's in a file, whereas a text metadata file can be viewed by anyone, and a flat binary file can be opened easily in any language.

We were about to create our own custom format, but then we discovered ASDF which is pretty close to what we need. The two main differences are directory structure and YAML, which seems basically unsupported in MATLAB.

embray commented 9 years ago

I did a quick looking around and came up with at least a couple YAML interfaces for MATLAB that use LibYAML wrapped in an MEX binary. But I'm guessing your point is that MATLAB has JSON support out of the box (I don't know)?

That said, I think with a YAML interface that a rudimentary ASDF reader in MATLAB could be achieved pretty easily. We also have plans for a C implementation of ASDF on the horizon, which could be added to MATLAB via the same approach.

Getting back to your specific use case though, it does make a lot of sense. However, even in the "exploded" form the individual binary blocks have a block header of I think about 40 bytes, so your user would still have to know at least enough to offset the array after that header.

The "exploded form" was not really meant for this case--I think (and @mdboom can expand) it is more of a performance trick. For example if an application has to stream some data to the end of a table that's embedded in an ASDF file, it can first "explode" the file so that the binary block containing the table is in a file by itself, and can be streamed to directly without having to shift around the rest of the file. But once the writing is done the full file can then be reassembled. There is also a kind of "streaming" block for this use case, which carries with the the restriction that no other blocks can follow it in the file.

That said, there might be a case for including simple instructions somewhere for manually reading the array data in an ASDF header, and translating that to reading the array in from the binary block. What do you think? It would be great to get the neuroscience community using ASDF--we have them to thank for matplotlib too by way of John Hunter :)