choderalab / TrustButVerify

Unit Tests for Protein Force Fields
GNU General Public License v2.0
1 stars 3 forks source link

Discussing storage formats #9

Open jchodera opened 10 years ago

jchodera commented 10 years ago

I just wanted to start what will probably be a longer term discussion about storage models.

We should start out by sketching out the data we want to store, as well as the desiderata for our storage model.

Data we need to store:

Metadata

kyleabeauchamp commented 10 years ago

IMHO we also need the app-layer ffXML file. The serialized XML file + Mol2 is nearly a "lossy" compression, in that we've essentially lost topology information at that point.

If we're going through some flavor of antechamber, we might also want those intermediate files as a paper trail, as there's some ambiguity about how one gets charges from the mol2 file.

There's also the box PDB file or equivalent.

I propose that for short-term, we simply switch to a directory-based model that has one "simulation + metadata" per directory. I find this to be a more natural way to interface with existing tools (OpenMM + VMD + MDTraj + Amber + Antechamber) than dumping everything into a single NetCDF file, which requires programming to access.

With the directories, we will need some way to generate "metadata summaries" in some human + machine readable format, perhaps XML or JSON (as in the FAHMunge suggestion of Robert).

kyleabeauchamp commented 10 years ago

I also think the directory structure will be beneficial for publication purposes. I think readers will be happier to find a directory of "familiar filetypes" than to receive a database file.

jchodera commented 10 years ago

I also think the directory structure will be beneficial for publication purposes. I think readers will be happier to find a directory of "familiar filetypes" than to receive a database file.

Provided everything is machine-readable and self-documenting, it should be trivial to convert back and forth, including preparing such files for postpublication release.

One thing that I realize is not something we should be using is DCD files. They are not platform-portable, because they must be read on an architecture with the same Endianness as they were written.

kyleabeauchamp commented 10 years ago

I think we can still use DCD in the OpenMM scripts and convert afterwards for more long-term archival purposes.

kyleabeauchamp commented 10 years ago

There is a also tool available for converting Endianness of DCD files: http://www.ks.uiuc.edu/Development/MDTools/flipdcd/

jchodera commented 10 years ago

Neither of these suggestions are a substitute for platform portable file formats. Surely there is another format that is actually portable. On Sep 18, 2014 10:27 AM, "kyleabeauchamp" notifications@github.com wrote:

There is a also tool available for converting Endianness of DCD files: http://www.ks.uiuc.edu/Development/MDTools/flipdcd/

— Reply to this email directly or view it on GitHub https://github.com/choderalab/TrustButVerify/issues/9#issuecomment-56045827 .

kyleabeauchamp commented 10 years ago

Not available via Reporters in OpenMM. There is a NetCDF reporter available in MDTraj, but that means we can't visualize trajectories without conversion.

kyleabeauchamp commented 10 years ago

IMHO DCDReporter is the best short-term solution, given what we have.

jchodera commented 10 years ago

If we want to use the format for reweighting beyond just an initial test for the grant, we will also need the ability to store multiple datasets for each molecule, where the parameters differ for each dataset.

We will also want to keep these grouped together because we need to cache the energies for all snapshots from all datasets for that molecule computes for all System objects for which simulations of that molecule were done. And also the free energies from MBAR.

Perhaps we should collect some info on the wiki so we know what kinds of data will need to be stored?

Honestly, I am really leaning toward having one NetCDF per molecule. I think things are going to get pretty crazy otherwise.

If we isolate this choice with a class to.manage the data (like it sounds you have), the underlying storage format should be simple to.swap out.

Neither of these suggestions are a substitute for platform portable file formats. Surely there is another format that is actually portable. On Sep 18, 2014 10:27 AM, "kyleabeauchamp" notifications@github.com wrote:

There is a also tool available for converting Endianness of DCD files: http://www.ks.uiuc.edu/Development/MDTools/flipdcd/

— Reply to this email directly or view it on GitHub https://github.com/choderalab/TrustButVerify/issues/9#issuecomment-56045827 .

jchodera commented 10 years ago

For pilot work, it's fine to continue with DCD. But we will need to move away from DCD before generating publishable data. On Sep 18, 2014 10:42 AM, "kyleabeauchamp" notifications@github.com wrote:

IMHO DCDReporter is the best short-term solution, given what we have.

— Reply to this email directly or view it on GitHub https://github.com/choderalab/TrustButVerify/issues/9#issuecomment-56048196 .

jchodera commented 10 years ago

We can revisit the reporter architecting issue as well in another week or two.

For now, what you have is great for generating preliminary grant data.

J On Sep 18, 2014 10:41 AM, "kyleabeauchamp" notifications@github.com wrote:

Not available via Reporters in OpenMM. There is a NetCDF reporter available in MDTraj, but that means we can't visualize trajectories without conversion.

— Reply to this email directly or view it on GitHub https://github.com/choderalab/TrustButVerify/issues/9#issuecomment-56048151 .

kyleabeauchamp commented 10 years ago

I agree that this will be a larger bottleneck when we begin reweighting work with multiple parameter sets. That sounds like a good time to discuss a flexible storage format.