Closed cortner closed 2 years ago
So the route that @albapa and I have taken is to write a "fat" file, with everything in it, even the original training data, i.e. enough to actually rerun the training with a future version of the code, but structured in such a way that the file can simply be transformed (by removing lines) to a "thin" format, that is just enough to evaluate the potential, possibly using some restricted versions of the code (in our case, with versions of the code > version that wrote the file, but you could even be thinner than that)
We write the fat version by default, because users often don't mind large files, and helps debugging. if there is a utility provided to transform fat files to thin files then they don't need to carry around large files if that is a problem. developers who might be creating a huge number of potential files in a short space of time during development will know how to switch on the thin writer.
Ok so that sounds like some form of mixed thin/fat format would be ideal.
Or just a structured fat file, so that it is easy to remove the fat…
-- Gábor
Gábor Csányi Professor of Molecular Modelling Engineering Laboratory, University of Cambridge Pembroke College Cambridge
Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/
On 13 Jul 2020, at 17:56, Christoph Ortner notifications@github.com wrote:
Ok so that sounds like some form of mixed thin/fat format would be ideal.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
We used to use a binary format which was very fast but a pain in all other respects. Then we went for XML, with CDATA lines for the meta-data, i.e. training configurations and command line options. I think this was a very good choice, for the reasons above. We actually have the options for companion files to store lots of reals, which are slow and cumbersome to read by XML - these are read in C. In the training code, there is an option to omit the training data, which is useful for explorations and quick tests, and for distribution we use the full version.
I guess today we would use a json file.
So far I've stored huge amounts of reals in a separate HDF5 file. So similar to your approach.
What's your view on JSON (XML) compressed as zip as needed?
I was going to say go with bson but then I saw your message on slack... Maybe read/write is still faster though.
I think zipping the json would be perfect - although I don't know how parsing performance (of default libraries) compares to XML.
I think that's where I'm going then. Julia has very nice zip format integration via ZipFile.jl
I'm going to close this - Zipped JSON files turn out to be easy to manage in Julia and exactly the level of flexibility we need.
I am trying to settle on a file format for v1.x and would appreciate feedback on some thoughts:
there are essentially two perspectives:
There are some benefits to both I think. E.g., 1 will lead to MUCH smaller files, but on the other hand it can only be read and understood with the code that wrote it. 2 on the other hand will have lots of "meta-data" type information that is not needed to reconstruct the types but it will make it easy to write a parser at some point that can read it even if the original code is lost or cannot be made to run for whatever reason...
As I'm writing this I wonder whether there is a third way:
@gabor1 your perspective would be particularly appreciated here.