Becksteinlab / GromacsWrapper

GromacsWrapper wraps system calls to GROMACS tools into thin Python classes (GROMACS 4.6.5 - 2024 supported).
https://gromacswrapper.readthedocs.org
GNU General Public License v3.0
169 stars 53 forks source link

Split the plugins into a separated repository #82

Closed pslacerda closed 7 years ago

pslacerda commented 8 years ago

We can move the plugins to a separated repository, it will be cleaner and make easier to move just the main code to Python 3.

pslacerda commented 8 years ago

Based on namespace packages we can make the plugins into a separated repository keeping it at the same gromacs namespace.

  1. Remove the gromacs/analysis/plugins/ directory
  2. Delete the line import plugins from gromacs/analysis/__init__.py
  3. Install the whole package
  4. Install the plugins package

Now we can

import gromacs.plugins

from a separated repository =). Or is better keep the gromacs.analysis.plugins namespace because all plugins are for analysis?

pslacerda commented 8 years ago

The main importance of plugins are to enable parallel analysis right? I heard about some guys that parallelized frame-by-frame analysis splitting the trajectories and submitting a job for each part then combining the results. They used Spark to do it on multiple computers but I don't go that far:

gmx rmsf -b    0 -e 1000 -o rmsf_0 &
gmx rmsf -b 1001 -e 2000 -o rmsf_1 &

With this very simple trick is possible to enable parallel analysis. The operating system takes care to allocate the resources intelligently. In most cases combining the results is just a simple concatenation as in .xvg files above. If I remember correctly from the lists the Gromacs team is also pursuing trivial analysis parallelization like this by default.

If we do:

def figure_out_length(f):
    return 1000

def parallel_analysis(tool, njobs, **kwargs):
    begin = kwargs.get('b', 0)
    end = kwargs.get('e', None)
    if end is None:
        end = figure_out_length(kwargs['f'])

    kwargs_list = []
    count = 0
    for part_begin in range(begin, end, (end-begin)//njobs+1):
        part_end = part_begin + ((end-begin)//njobs) - 1
        if part_end > end:
            part_end = end
        part_kwargs = kwargs.copy()
        part_kwargs['b'] = part_begin
        part_kwargs['e'] = part_end
        for key, value in kwargs.items():
            if isinstance(value, str) and '%d' in value:
                part_kwargs[key] = value % count
        kwargs_list.append(part_kwargs)
        count = count + 1
    return kwargs_list

And then:

>>> parallel_analysis('rmsf', 8, f='traj.xtc', o='rmsf%d.xvg',  b=100, input=['3', '3'])
[{'b': 100, 'e': 211, 'f': 'traj.xtc', 'input': ['3', '3'], 'o': 'rmsf0.xvg'},
 {'b': 213, 'e': 324, 'f': 'traj.xtc', 'input': ['3', '3'], 'o': 'rmsf1.xvg'},
 {'b': 326, 'e': 437, 'f': 'traj.xtc', 'input': ['3', '3'], 'o': 'rmsf2.xvg'},
 {'b': 439, 'e': 550, 'f': 'traj.xtc', 'input': ['3', '3'], 'o': 'rmsf3.xvg'},
 {'b': 552, 'e': 663, 'f': 'traj.xtc', 'input': ['3', '3'], 'o': 'rmsf4.xvg'},
 {'b': 665, 'e': 776, 'f': 'traj.xtc', 'input': ['3', '3'], 'o': 'rmsf5.xvg'},
 {'b': 778, 'e': 889, 'f': 'traj.xtc', 'input': ['3', '3'], 'o': 'rmsf6.xvg'},
 {'b': 891, 'e': 1000, 'f': 'traj.xtc', 'input': ['3', '3'], 'o': 'rmsf7.xvg'}]

We get a list of argument dicts where each one is for a partition of the full trajectory. If we run parallelized analysis in parallel to each other we guarantee that the machine will be with all processors in use until the end of the last analysis. Massive trivial parallization. =) =)

pslacerda commented 8 years ago

So remains to do a xvg joiner. It is far from be my favorite file format but is a matter of write the first fully then write the remaining ones removing the header.

pslacerda commented 8 years ago

Some analysis as RMSD and RMSF requires a same fixed reference (i.e. -s option) to make sense. In others analysis a reference isn't needed.

orbeckst commented 8 years ago

I am totally happy to move the plugins into their own name space.Name space packages are a bit tricky (I think @dotsdl can attest from datreant). As far as I know, you'd then need to package everything else under a second namespace, eg gromacs.core. We might still be able to monkey-patch the tools into the top level, though. So maybe for GW, namespace packages would be useful. We could then also make the fileformats a separate package.

orbeckst commented 8 years ago

Regarding analysis and parallel analysis: We are almost exclusively using MDAnalysis nowadays. Combine it with pandas for time series analysis and plotting with matplotlib/seaborn.

Parallel analysis is still tricky. The blocked trajectory scheme is solid in principle. The main problem seems competing disk access – this tends to kill performance and sets a limit to how many workers you can sensibly use.

That said, I am more than happy to include anything in GW that seems to work well – so if you have a suggestion, go for it :-).

pslacerda commented 8 years ago

Probably because of this nobody parallelized Gromacs analysis tools. RMSF for example also seems I/O bound at least here.

Maybe are namespace packages tricky because every package need to declare itself as such? This isn't a problem here as we will have only one or two separated packages (gromacs.analysis and gromacs.fileformats). Regarding analysis plugins we can instead ignore metclassses and just inspect BasePlugin.__bases__. Or yet make them into a different namespace (e.g. gromacsplugins) and create a metclass that monkey patch analysis plugins automatically into object inside gromacs but remove the BasePlugin:

class PluginRegister(type):
    def __init__(cls, name, bases, nmspc):
        super(Plugin, cls).__init__(name, bases, nmspc)
        if not hasattr(cls, 'registry'):
            cls.registry = set()
        cls.registry.add(cls)
        cls.registry -= set([BasePlugin])

class BasePlugin(object):
    __metaclass__ = PluginRegister

I vote for namespace packages!

pslacerda commented 8 years ago

That example doesn't really seems I/O bound!

orbeckst commented 8 years ago

I definitely support some form of namespace packaging. The S/O post http://stackoverflow.com/questions/1675734/how-do-i-create-a-namespace-package-in-python/1676069#1676069 makes it look pretty straightforward and it can be done in a Python 2 and 3.3+ compatible way.

I'd like to hear @dotsdl 's opinion because he went through this for datreant, see datreant/datreant.core#35.

What packages would we have?

EDIT: Perhaps we shouldn't overdo it with packages that only contain a few modules. Something along

would work?

pslacerda commented 8 years ago

We can make gromacs.analysis deprecated. Or if it's legacy then we just drop it.

recipes, fileformats, management and analysis may deserve a separated package each. But tools, config and utilities can be at the same repository and doesn't need monkey patching or whatever.

As gnuplot, XVG is an almost complete language and Gromacs usage is very specific to plot one or two series along the time. So almost every XVG reader is incomplete or specific except xmgrace.

Did you saw the new_core branch? I'll do a PR. And on gmxscript there are one useful utility that is MDPReader, it can extend basic MDP files on the fly.

grompp(
  f=MDP['sd.mdp', {
    'integrator': 'steep',
    'emtol': 10.0,
    'nsteps': 10000}],
  c='ions.gro',
  o='sd.tpr'
)

Or maybe yet without a template file:

 MDP[{
  'integrator': 'steep',
  'emtol': 10.0,
  'nsteps': 10000
}]

Then a function mdp() becomes more elegant.

With these changes GromacsWrapper looks more like utilities than a complete library or framework, which is a gain in my opinion.

dotsdl commented 7 years ago

My vote, FWIW: keep it simple. I think cutting out complex analysis entirely is a good idea, especially if this isn't really getting any use these days. I'd rather have the library give an interface in Python to the GROMACS tools and nothing more, since we already have enough things to maintain these days that do just about everything else, but with more flexibility.

From my experience with datreant, I'm thinking of doing the same kind of cutting down to bare essentials there, too, since trying to do everything means maintaining lots of non-general-purpose code, and there are only so many hours in the day.

orbeckst commented 7 years ago

I think everyone is in agreement there. Just needs to be done...

-- Oliver Beckstein email: orbeckst@gmail.com

Am Feb 18, 2017 um 12:08 schrieb David Dotson notifications@github.com:

My vote, FWIW: keep it simple. I think cutting out complex analysis entirely is a good idea, especially if this isn't really getting any use these days

dotsdl commented 7 years ago

@orbeckst in that case can I focus on this for an afternoon or so? It'll be a massive PR (mostly removing an entire chunk of the library), but it will really help me finish up #44 so we can move on.

orbeckst commented 7 years ago

Yes, please do.

Can you dump what you cut out into a separate repo? It's going to look like a junk yard but it will give us a way to go back to it if ever need to (without digging into the history).

orbeckst commented 6 years ago

Updated https://github.com/Becksteinlab/GromacsWrapper/wiki/Analysis-plugins