choderalab / fahmunge

Tools for Munging Folding@Home datasets
MIT License
4 stars 6 forks source link

Initial Draft of Automated Data Munging #1

Closed kyleabeauchamp closed 10 years ago

kyleabeauchamp commented 10 years ago
  1. Convert Bzipped XTC files to all-atom HDF5 files with extra meta containing already processed filenames
  2. Strip water from all-atom HDF5 files to create protein HDF5 files
  3. Do this periodically on all local FAH datasets
jchodera commented 10 years ago

What if we called this fah-tools? Or do you really want a separate repo for "munging"?

kyleabeauchamp commented 10 years ago

Tools is pretty general; right now, the code is just for munging. We can change the name if the scope of the code expands in the future.

jchodera commented 10 years ago

This looks pretty good! The only thing I'd ask for is more documentation in the code about what the various "munging" steps do.

schwancr commented 10 years ago

What about periodic image issues?

kyleabeauchamp commented 10 years ago

I suppose we'll have to add that later, as AFAIK we don't have pbc whole implemented in MDTraj. I'll look into that.

Right now, the key issue is automating the bunzip, which currently nearly 15 seconds per WU and makes it nearly impossible to do meaningful real-time analysis / reporting...

kyleabeauchamp commented 10 years ago

@schwancr I just adjusted the stripping function keep the unitcell information in the protein HDF5, which should allow us to perform downstream PBC changes.

schwancr commented 10 years ago

Yea that sounds like a good idea. Ideally mdtraj will be able to do this in the future, though it's not trivial to implement.

rmcgibbo commented 10 years ago

Has anyone looked at the PBC-whole code in gromacs or ambertools? It might actually not be that complex.

-Robert

On Mon, Sep 15, 2014 at 12:15 PM, Christian Schwantes < notifications@github.com> wrote:

Yea that sounds like a good idea. Ideally mdtraj will be able to do this in the future, though it's not trivial to implement.

— Reply to this email directly or view it on GitHub https://github.com/FoldingAtHome/FAHMunge/pull/1#issuecomment-55643113.

schwancr commented 10 years ago

But it doesn't work that well. They're (gromacs) recipe for doing it involves several calls of the same command-line script and even then they admit it doesn't work in all cases.

rmcgibbo commented 10 years ago

@kyleabeauchamp: what's the appropriate forum to discuss the provenance metadata storage (e.g. processed_filenames), and the directory structure we want to encourage for FAH projects and mixtape?

I'm not sure that storing extra attributes on the HDF5 files is the best way to go -- if we really want to do that, we should consider simply adding that field to the MDTraj HDF5 format spec. We could also do something more akin to the MSMBuilder 2 design, where a separate metadata file is stored which contains the provenance info. It might be nice, also, not to irreversibly tie this data munging step to the use of HDF5 files for the output.

It would be helpful to get to some consensus on these design choices, especially as we start pushing mixtape for end users.

kyleabeauchamp commented 10 years ago

This is working well enough for now, we will discuss future iterations in issue #2