choderalab / fahmunge

Tools for Munging Folding@Home datasets
MIT License
4 stars 6 forks source link
fah foldingathome mdtraj pipeline python

Build Status codecov Anaconda Cloud Badge Downloads

fahmunge

A tool for automated processing of Folding@home data to produce mdtraj-compatible trajectory sets.

Authors

Supported FAH cores

Installation

The easiest way to install fahmunge and its dependencies is via conda (preferably miniconda):

conda install --yes -c omnia fahmunge

Usage

Basic Usage

Basic usage simply specifies a project CSV file and an output path for the munged data:

munge-fah-data --projects projects.csv --outpath /data/choderalab/fah/munged3 --nprocesses 16 --validate

The metadata for FAH is a CSV file located here on choderalab FAH servers:

/data/choderalab/fah/Software/FAHMunge/projects.csv

This file specifies the project number, the location of the FAH data, a reference PDB file (or files) to be used for munging, and the MDTraj DSL topology selection to be used for extracting solute coordinates of interest.

For example:

project,location,pdb,topology_selection
"10491","/home/server.140.163.4.245/server2/data/SVR2359493877/PROJ10491/","/home/server.140.163.4.245/server2/projects/GPU/p10491/topol-renumbered-explicit.pdb","not water"
"10492","/home/server.140.163.4.245/server2/data/SVR2359493877/PROJ10492/","/home/server.140.163.4.245/server2/projects/GPU/p10492/topol-renumbered-explicit.pdb","not (water or resname NA or resname CL)"
"10495","/home/server.140.163.4.245/server2/data/SVR2359493877/PROJ10492/","/home/server.140.163.4.245/server2/projects/GPU/p10495/MTOR_HUMAN_D0/RUN%(run)d/system.pdb","not (water or resname NA or resname CL)"

pdb points the pipeline toward a PDB file to look at for numbering atoms in the munged data. The top two lines are examples of using a single PDB for all RUNs in the project. The third line shows how to use a different PDB for each RUN. %(run)d is substituted by the run number via filename % vars() in Python, which allows run numbers or other local Python variables to be substituted. Substitution is only performed on a per-run basis, not per-clone.

The projects CSV file will undergo minimal validation automatically to make sure all data and file paths can be found.

Advanced Usage

More advanced usage allows additional arguments to be specified:

Usage on choderalab Folding@home servers

  1. Login to work server using the usual FAH login
  2. Check if script is running (screen -r -d). If True, stop here.
  3. Start a screen session
  4. Run with: munge-fah-data --projects /data/choderalab/fah/projects.csv --outpath /data/choderalab/fah/munged-data --time 600 --nprocesses 16
  5. To stop, control c when the script is in the "sleep" phase

How it works

Overall Pipeline (Core17/18):

  1. Extract XTC data from bzips
  2. Append all-atom coordinates and filenames to HDF5 file
  3. Extract protein coordinates and filenames from the all-atom HDF5 file into a second HDF5 file

Efficiency considerations

The rate limiting step appears to be bunzip.
If we can avoid having the trajectories be double-bzipped by the client, this will speed up things immensely.

Nightly syncing to hal.cbio.mskcc.org

Munged no-solvent data is rsynced nightly from plfah1 and plfah2 to hal.cbio.mskcc.org via the choderalab robot user account to:

/cbio/jclab/projects/fah/fah-data/munged

This is done via a crontab:

# kill any rsyncs already in progress
42 00 * * * skill rsync
# munged3
04 01 * * * rsync -av --append-verify --bwlimit=1000 --chmod=g-w,g+r,o-w,o+r server@plfah1.mskcc.org:/data/choderalab/fah/munged2/no-solvent /cbio/jclab/projects/fah/fah-data/munged3 >> $HOME/plfah1-rsync3-no-solvent.log 2>&1
38 02 * * * rsync -av --append-verify --bwlimit=1000 --chmod=g-w,g+r,o-w,o+r server@plfah2.mskcc.org:/data/choderalab/fah/munged2/no-solvent /cbio/jclab/projects/fah/fah-data/munged3 >> $HOME/plfah2-rsync3-no-solvent.log 2>&1
34 03 * * * rsync -av --append-verify --bwlimit=1000 --chmod=g-w,g+r,o-w,o+r server@plfah1.mskcc.org:/data/choderalab/fah/munged2/all-atoms /cbio/jclab/projects/fah/fah-data/munged3 >> $HOME/plfah1-rsync3-all-atoms.log 2>&1
50 03 * * * rsync -av --append-verify --bwlimit=1000 --chmod=g-w,g+r,o-w,o+r server@plfah2.mskcc.org:/data/choderalab/fah/munged2/all-atoms /cbio/jclab/projects/fah/fah-data/munged3 >> $HOME/plfah2-rsync3-all-atoms.log 2>&1

To install this crontab as the choderalab user:

crontab ~/crontab

To list the active crontab:

crontab -l

Transfers are logged in the choderalab account:

plfah1-rsync3-all-atoms.log
plfah1-rsync3-no-solvent.log
plfah2-rsync3-all-atoms.log
plfah2-rsync3-no-solvent.log

Acknowledgements

Project skeleton based on the Computational Chemistry Python Cookiecutter