File format - Githubissues

cortner commented 4 years ago

I am trying to settle on a file format for v1.x and would appreciate feedback on some thoughts:

there are essentially two perspectives:

Use a slim format that contains the bare minimum of information to reconstruct the bases and potentials; or
Dump the entire Julia type into a file including everything that could be recomputed at runtime.

There are some benefits to both I think. E.g., 1 will lead to MUCH smaller files, but on the other hand it can only be read and understood with the code that wrote it. 2 on the other hand will have lots of "meta-data" type information that is not needed to reconstruct the types but it will make it easy to write a parser at some point that can read it even if the original code is lost or cannot be made to run for whatever reason...

As I'm writing this I wonder whether there is a third way:

have a minimal "slim" file format as default, but provide the option of saving also this meta-data mentioned above which will be a "human-readable" specification of potential or basis.

@gabor1 your perspective would be particularly appreciated here.

gabor1 commented 4 years ago

So the route that @albapa and I have taken is to write a "fat" file, with everything in it, even the original training data, i.e. enough to actually rerun the training with a future version of the code, but structured in such a way that the file can simply be transformed (by removing lines) to a "thin" format, that is just enough to evaluate the potential, possibly using some restricted versions of the code (in our case, with versions of the code > version that wrote the file, but you could even be thinner than that)

We write the fat version by default, because users often don't mind large files, and helps debugging. if there is a utility provided to transform fat files to thin files then they don't need to carry around large files if that is a problem. developers who might be creating a huge number of potential files in a short space of time during development will know how to switch on the thin writer.

cortner commented 4 years ago

Ok so that sounds like some form of mixed thin/fat format would be ideal.

gabor1 commented 4 years ago

Or just a structured fat file, so that it is easy to remove the fat…

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory, University of Cambridge Pembroke College Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 13 Jul 2020, at 17:56, Christoph Ortner notifications@github.com wrote:

Ok so that sounds like some form of mixed thin/fat format would be ideal.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

albapa commented 4 years ago

We used to use a binary format which was very fast but a pain in all other respects. Then we went for XML, with CDATA lines for the meta-data, i.e. training configurations and command line options. I think this was a very good choice, for the reasons above. We actually have the options for companion files to store lots of reals, which are slow and cumbersome to read by XML - these are read in C. In the training code, there is an option to omit the training data, which is useful for explorations and quick tests, and for distribution we use the full version.

I guess today we would use a json file.

cortner commented 4 years ago

So far I've stored huge amounts of reals in a separate HDF5 file. So similar to your approach.

What's your view on JSON (XML) compressed as zip as needed?

albapa commented 4 years ago

I was going to say go with bson but then I saw your message on slack... Maybe read/write is still faster though.

I think zipping the json would be perfect - although I don't know how parsing performance (of default libraries) compares to XML.

cortner commented 4 years ago

I think that's where I'm going then. Julia has very nice zip format integration via ZipFile.jl

cortner commented 2 years ago

I'm going to close this - Zipped JSON files turn out to be easy to manage in Julia and exactly the level of flexibility we need.

ACEsuit / ACE.jl

File format #12