Closed pslacerda closed 7 years ago
Based on namespace packages we can make the plugins into a separated repository keeping it at the same gromacs
namespace.
gromacs/analysis/plugins/
directoryimport plugins
from gromacs/analysis/__init__.py
Now we can
import gromacs.plugins
from a separated repository =). Or is better keep the gromacs.analysis.plugins
namespace because all plugins are for analysis?
The main importance of plugins are to enable parallel analysis right? I heard about some guys that parallelized frame-by-frame analysis splitting the trajectories and submitting a job for each part then combining the results. They used Spark to do it on multiple computers but I don't go that far:
gmx rmsf -b 0 -e 1000 -o rmsf_0 &
gmx rmsf -b 1001 -e 2000 -o rmsf_1 &
With this very simple trick is possible to enable parallel analysis. The operating system takes care to allocate the resources intelligently. In most cases combining the results is just a simple concatenation as in .xvg files above. If I remember correctly from the lists the Gromacs team is also pursuing trivial analysis parallelization like this by default.
If we do:
def figure_out_length(f):
return 1000
def parallel_analysis(tool, njobs, **kwargs):
begin = kwargs.get('b', 0)
end = kwargs.get('e', None)
if end is None:
end = figure_out_length(kwargs['f'])
kwargs_list = []
count = 0
for part_begin in range(begin, end, (end-begin)//njobs+1):
part_end = part_begin + ((end-begin)//njobs) - 1
if part_end > end:
part_end = end
part_kwargs = kwargs.copy()
part_kwargs['b'] = part_begin
part_kwargs['e'] = part_end
for key, value in kwargs.items():
if isinstance(value, str) and '%d' in value:
part_kwargs[key] = value % count
kwargs_list.append(part_kwargs)
count = count + 1
return kwargs_list
And then:
>>> parallel_analysis('rmsf', 8, f='traj.xtc', o='rmsf%d.xvg', b=100, input=['3', '3'])
[{'b': 100, 'e': 211, 'f': 'traj.xtc', 'input': ['3', '3'], 'o': 'rmsf0.xvg'},
{'b': 213, 'e': 324, 'f': 'traj.xtc', 'input': ['3', '3'], 'o': 'rmsf1.xvg'},
{'b': 326, 'e': 437, 'f': 'traj.xtc', 'input': ['3', '3'], 'o': 'rmsf2.xvg'},
{'b': 439, 'e': 550, 'f': 'traj.xtc', 'input': ['3', '3'], 'o': 'rmsf3.xvg'},
{'b': 552, 'e': 663, 'f': 'traj.xtc', 'input': ['3', '3'], 'o': 'rmsf4.xvg'},
{'b': 665, 'e': 776, 'f': 'traj.xtc', 'input': ['3', '3'], 'o': 'rmsf5.xvg'},
{'b': 778, 'e': 889, 'f': 'traj.xtc', 'input': ['3', '3'], 'o': 'rmsf6.xvg'},
{'b': 891, 'e': 1000, 'f': 'traj.xtc', 'input': ['3', '3'], 'o': 'rmsf7.xvg'}]
We get a list of argument dicts where each one is for a partition of the full trajectory. If we run parallelized analysis in parallel to each other we guarantee that the machine will be with all processors in use until the end of the last analysis. Massive trivial parallization. =) =)
So remains to do a xvg joiner. It is far from be my favorite file format but is a matter of write the first fully then write the remaining ones removing the header.
Some analysis as RMSD and RMSF requires a same fixed reference (i.e. -s
option) to make sense. In others analysis a reference isn't needed.
I am totally happy to move the plugins into their own name space.Name space packages are a bit tricky (I think @dotsdl can attest from datreant). As far as I know, you'd then need to package everything else under a second namespace, eg gromacs.core. We might still be able to monkey-patch the tools into the top level, though. So maybe for GW, namespace packages would be useful. We could then also make the fileformats a separate package.
Regarding analysis and parallel analysis: We are almost exclusively using MDAnalysis nowadays. Combine it with pandas for time series analysis and plotting with matplotlib/seaborn.
Parallel analysis is still tricky. The blocked trajectory scheme is solid in principle. The main problem seems competing disk access – this tends to kill performance and sets a limit to how many workers you can sensibly use.
That said, I am more than happy to include anything in GW that seems to work well – so if you have a suggestion, go for it :-).
Probably because of this nobody parallelized Gromacs analysis tools. RMSF for example also seems I/O bound at least here.
Maybe are namespace packages tricky because every package need to declare itself as such? This isn't a problem here as we will have only one or two separated packages (gromacs.analysis
and gromacs.fileformats
). Regarding analysis plugins we can instead ignore metclassses and just inspect BasePlugin.__bases__
. Or yet make them into a different namespace (e.g. gromacsplugins
) and create a metclass that monkey patch analysis plugins automatically into object inside gromacs
but remove the BasePlugin
:
class PluginRegister(type):
def __init__(cls, name, bases, nmspc):
super(Plugin, cls).__init__(name, bases, nmspc)
if not hasattr(cls, 'registry'):
cls.registry = set()
cls.registry.add(cls)
cls.registry -= set([BasePlugin])
class BasePlugin(object):
__metaclass__ = PluginRegister
I vote for namespace packages!
That example doesn't really seems I/O bound!
I definitely support some form of namespace packaging. The S/O post http://stackoverflow.com/questions/1675734/how-do-i-create-a-namespace-package-in-python/1676069#1676069 makes it look pretty straightforward and it can be done in a Python 2 and 3.3+ compatible way.
I'd like to hear @dotsdl 's opinion because he went through this for datreant, see datreant/datreant.core#35.
What packages would we have?
gromacs.core
: main functionality (tools, config, utilities, ...), creates gromacs.grompp
etc by monkey patchinggromacs.fileformats
: file format readers like the XVG reader which are independent from the rest; they might have dependencies on utilities so might not be easily feasible to make it independentgromacs.recipes
: need to find a better name, but basically things like setup
, cbook
, scaling
, ... anything else that can be considered building blocks for workflows but which might not be used by everyone (e.g., many power users like to write their own system setup code); most of gw-*
scripts would be installed by this package.gromacs.management
: need a better name... but basically manager
and qsub
: these are lightweight attempts at workflow managements. qsub
is being used e.g. in MDPOW and needs to remain available.gromacs.plugins
: the legacy analysis plugins; I am not sure if this is even used by anyone so I would not spend too much time making it nice... as long as it works it will be fine. (I use `gw-fit_strip_trajectories.py which uses one of the plugins but for pretty much everything else have been using MDAnalysis).EDIT: Perhaps we shouldn't overdo it with packages that only contain a few modules. Something along
gromacs.core
(including fileformats)gromacs.toolbox
(including setup, cbook, manager, qsub)gromacs.analysis
(or plugins?) would work?
We can make gromacs.analysis
deprecated. Or if it's legacy then we just drop it.
recipes
, fileformats
, management
and analysis
may deserve a separated package each. But tools
, config
and utilities
can be at the same repository and doesn't need monkey patching or whatever.
As gnuplot, XVG is an almost complete language and Gromacs usage is very specific to plot one or two series along the time. So almost every XVG reader is incomplete or specific except xmgrace.
Did you saw the new_core
branch? I'll do a PR. And on gmxscript there are one useful utility that is MDPReader
, it can extend basic MDP files on the fly.
grompp(
f=MDP['sd.mdp', {
'integrator': 'steep',
'emtol': 10.0,
'nsteps': 10000}],
c='ions.gro',
o='sd.tpr'
)
Or maybe yet without a template file:
MDP[{
'integrator': 'steep',
'emtol': 10.0,
'nsteps': 10000
}]
Then a function mdp()
becomes more elegant.
With these changes GromacsWrapper looks more like utilities than a complete library or framework, which is a gain in my opinion.
My vote, FWIW: keep it simple. I think cutting out complex analysis entirely is a good idea, especially if this isn't really getting any use these days. I'd rather have the library give an interface in Python to the GROMACS tools and nothing more, since we already have enough things to maintain these days that do just about everything else, but with more flexibility.
From my experience with datreant
, I'm thinking of doing the same kind of cutting down to bare essentials there, too, since trying to do everything means maintaining lots of non-general-purpose code, and there are only so many hours in the day.
I think everyone is in agreement there. Just needs to be done...
-- Oliver Beckstein email: orbeckst@gmail.com
Am Feb 18, 2017 um 12:08 schrieb David Dotson notifications@github.com:
My vote, FWIW: keep it simple. I think cutting out complex analysis entirely is a good idea, especially if this isn't really getting any use these days
@orbeckst in that case can I focus on this for an afternoon or so? It'll be a massive PR (mostly removing an entire chunk of the library), but it will really help me finish up #44 so we can move on.
Yes, please do.
Can you dump what you cut out into a separate repo? It's going to look like a junk yard but it will give us a way to go back to it if ever need to (without digging into the history).
We can move the plugins to a separated repository, it will be cleaner and make easier to move just the main code to Python 3.