alchemistry / alchemlyb

the simple alchemistry library

https://alchemlyb.readthedocs.io

BSD 3-Clause "New" or "Revised" License

197 stars 51 forks source link

API proposal #6

Closed dotsdl closed 7 years ago

dotsdl commented 7 years ago

The following is an API proposal for the library. This proposal has been prototyped, with some of the components described already implemented at a basic level. This functionality is demoed in this gist.

`alchemlyb`

alchemlyb is a library that seeks to make doing alchemical free energy calculations easier and less error prone. It will include functions for parsing data from formats common to existing MD engines, subsampling these data, and fitting these data with an estimator to obtain free energies. These functions will be simple in usage and pure in scope, and can be chained together to build customized analyses of data.

alchemlyb seeks to be as boring and simple as possible to enable more complex work. Its components allow work at all scales, from use on small systems using a single workstation to larger datasets that require distributed computing using libraries such as dask.

Core philosophy

Use functions when possible, classes only when necessary (or for estimators, see (2)).
For estimators, mimic the scikit-learn API as much as possible.
Aim for a consistent interface throughout, e.g. all parsers take similar inputs and yield a common set of outputs.

API components

The library is structured as follows, following a similar style to scikit-learn:

alchemlyb
|
 -- parsing
 |  |
 |   -- gmx
 |  |
 |   -- amber
 |  |
 |   -- openmm
 |  |
 |   -- namd
 |
  -- preprocessing
 |  |
 |   -- subsampling
 |
  -- estimators
    |
     -- mbar_
    |
     -- ti_

The parsing submodule contains parsers for individual MD engines, since the output files needed to perform alchemical free energy calculations vary widely and are not standardized. Each module at the very least provides an extract_u_nk function for extracting reduced potentials (needed for MBAR), as well as an extract_DHdl function for extracting derivatives required for thermodynamic integration. Other helper functions may be exposed for additional processing, such as generating an XVG file from an EDR file in the case of GROMACS. All extract_* functions take similar arguments (a file path, parameters such as temperature), and produce standard outputs (pandas.DataFrames for reduced potentials, pandas.Series for derivatives).

The preprocessing submodule features functions for subsampling timeseries, as may be desired before feeding them to an estimator. So far, these are limited to slicing, statistical_inefficiency, and equilibrium_detection functions, many of which make use of subsampling schemes available from pymbar. These functions are written in such a way that they can be easily composed as parts of complex processing pipelines.

The estimators module features classes a la scikit-learn that can be initialized with parameters that determine their behavior and then "trained" on a fit method. So far, MBAR has been partially implemented, and because the numerical heavy-lifting is already well-implemented in pymbar.MBAR, this class serves to give an interface that will be familiar and consistent with the others. Thermodynamic integration is not yet implemented.

The convergence submodule will feature convenience functions/classes for doing convergence analysis using a given dataset and a chosen estimator, though the form of this is not yet thought-out. However, the gist shows an example for how this can be done already in practice.

All of these components lend themselves well to writing clear and flexible pipelines for processing data needed for alchemical free energy calculations, and furthermore allow for scaling up via libraries like dask or joblib.

Development model

This is an open-source project, the hope of which is to produce a library with which the community is happy. To enable this, the library will be a community effort. Development is done in the open on GitHub, with a Gitter channel for discussion among developers for fast turnaround on ideas. Software engineering best-practices will be used throughout, including continuous integration testing via Travis CI, up-to-date documentation, and regular releases.

David Dotson (@dotsdl) is employed as a software engineer by Oliver Beckstein (@orbeckst), and this project is a primary point of focus for him in this position. Ian Kenney (@ianmkenney) and Hannes Loeffler (@halx) have also expressed interest in direct development.

Following discussion, refinement, and consensus on this proposal, issues for each need will be posted and work will begin on filling out the rest of the library. In particular, parsers will be crowdsourced from the existing community and refined into the consistent form described above. Expertise in ensuring theoretical correctness of each component, in particular estimators, will be needed from David Mobley (@davidmobley), John Chodera (@jchodera), and Michael Shirts (@mrshirts).

davidlmobley commented 7 years ago

I'll look at this in more detail next week after I finish recovering from being sick. However, @limn1 may also have an interest in this topic.

dotsdl commented 7 years ago

@davidlmobley no worries, I think we'll probably give this at least a week's worth of discussion before moving forward. I think it's important we like the direction we're heading before we strike out on one. Let me know if there's anything I should clarify further, since I'm sure there are gaps in my description that I'm not aware of.

halx commented 7 years ago

Many thanks for all you work!

I would, however, too need some time to look into this in detail. Just a few first notes.

When you say "doing alchemical free energy calculations easier" I suppose you are referring to analysis? We need to be clear on what the parsers pass back to the main code. We need to be sure what assumption we can and cannot make. The current code is obviously made with Gromacs in mind but other codes possibly work (very) differently. E.g. Gromacs write gradients and BAR/MBAR data on the same time step but AMBER allows doing that independently. Whether that flexibility makes much sense or not is not really relevant when a user has created their output in that way. We can also not assume that MD codes provide both TI and FEP data at the same time. There are codes where this is an either-or decision (NAMD), either variant can be be switched off (I think also in Gromacs?) or may not provide some information at all. What is the point of each parser having helper functions? Why not return the data structures directly? And if helper functions that would point to using classes but that's what you want to avoid? What comes after estimator? There should be some minimal form of post-processing. I would want to see the raw data in all possible forms (some data is MD code specific so parser should have parser specific options, e.g. see my AMBER parser). Maybe also basic plotting as it is done now but generally I would advocate against sophisticated plotting functionality. Costs a lot of time and full flexibility can only be achieved with one's own scripting and one's own choice of plotting libraries, etc. We also need a reference implementation of a user program. Do we have a NAMD parser? I just recently started to work on this (for TI only for the time) but I then I remembered that @davidlmobley said that his group would have/work on one?

dotsdl commented 7 years ago

Hey @halx, great questions! I'll try to answer each of them:

When you say "doing alchemical free energy calculations easier" I suppose you are referring to analysis?

Yes, we're only talking about taking already-existing raw data obtained from an MD engine, processing these data, and getting free energy estimates from them. Setting up or running simulations to produce these data is outside the scope of alchemlyb.

We need to be clear on what the parsers pass back to the main code. We need to be sure what assumption we can and cannot make. The current code is obviously made with Gromacs in mind but other codes possibly work (very) differently. E.g. Gromacs write gradients and BAR/MBAR data on the same time step but AMBER allows doing that independently. Whether that flexibility makes much sense or not is not really relevant when a user has created their output in that way.

By main code, do you mean the estimators? The parsers don't pass anything to the estimators themselves; they only take data files as input (which as you note, will vary widely across MD packages) and yield a specific quantity (e.g. reduced potentials, gradients) in a common data structure. The idea here is that we decouple the source of data (MD-engine-specific data files) from analysis of that data. Things like estimators and preprocessing functions don't need to know anything about what generated the data; MD-engine-specific concerns are confined to parser code.

We can also not assume that MD codes provide both TI and FEP data at the same time. There are codes where this is an either-or decision (NAMD), either variant can be be switched off (I think also in Gromacs?) or may not provide some information at all.

This is fine under the current scheme. alchemlyb is a library, so if you didn't generate e.g. what you need for TI, then you wouldn't use anything related to TI from the library. All parsers should definitely have good exceptions with informative messages telling a user whether or not their data is in the right form to be handled upfront, though.

What is the point of each parser having helper functions? Why not return the data structures directly?

The alchemlyb.parsers.gmx module has a generate_xvg function that simply takes an EDR file and writes out an XVG file, which the extract_* functions need. This is purely a convenience function, and we could remove it. The point, I think, is that these parsing modules should provide to a reasonable degree what a user needs to go from data files to data structures, and the conveniences that help that along will vary by MD engine. Ideally, no convenience functions like generate_xvg would be needed, but I think a case can be made for giving them.

And if helper functions that would point to using classes but that's what you want to avoid?

I think for parsing we should definitely stick to functions yielding standard numerical data structures, specifically pandas DataFrames and Series. I don't believe it makes sense to use classes here.

What comes after estimator? There should be some minimal form of post-processing. I would want to see the raw data in all possible forms (some data is MD code specific so parser should have parser specific options, e.g. see my AMBER parser).

After the estimator, one could do all sorts of things, like convergence analysis. My gist shows one form of this, and I believe conveniences should be built for doing this kind of thing, though I don't yet know what they should look like. pymbar.MBAR also exposes a lot of additional functionality for understanding and using the resulting free energies from MBAR, and we could use this as a guide for what we want exposed by alchemlyb generally.

Maybe also basic plotting as it is done now but generally I would advocate against sophisticated plotting functionality. Costs a lot of time and full flexibility can only be achieved with one's own scripting and one's own choice of plotting libraries, etc.

Agreed, though I think plotting should be avoided entirely, and ideally the data structures we're using themselves make plotting straightforward (pandas objects, numpy arrays). Plotting code comes with a lot of personal preference, making it susceptible to bikeshedding, so I think it best if we focus our limited energy and time on data processing itself.

We also need a reference implementation of a user program.

I think the idea was that alchemical-analysis would be refactored to use alchemlyb components, simplifying it internally and making it double as a useful example of how one can use alchemlyb in practice.

Do we have a NAMD parser?

We don't, but we need one! If you have parsing code, clone the project, make an alchemlyb.parsing.namd module in the source tree, and start by dumping the code directly in. Open a pull request with "WIP" (work-in-progress) in the title and we can get started on molding it into the common form we agree upon for parsers here.

mrshirts commented 7 years ago

I have a NAMD user -- JC Gumbart at CalTech. I even have some datafiles from him now asking to adapt alchemical-analysis.

On Mon, Jan 9, 2017 at 11:41 AM, David Dotson notifications@github.com wrote:

Hey @halx https://github.com/halx, great questions! I'll try to answer each of them:

When you say "doing alchemical free energy calculations easier" I suppose you are referring to analysis?

Yes, we're only talking about taking already-existing raw data obtained from an MD engine, processing these data, and getting free energy estimates from them. Setting up or running simulations to produce these data is outside the scope of alchemlyb.

We need to be clear on what the parsers pass back to the main code. We need to be sure what assumption we can and cannot make. The current code is obviously made with Gromacs in mind but other codes possibly work (very) differently. E.g. Gromacs write gradients and BAR/MBAR data on the same time step but AMBER allows doing that independently. Whether that flexibility makes much sense or not is not really relevant when a user has created their output in that way.

By main code, do you mean the estimators? The parsers don't pass anything to the estimators themselves; they only take data files as input (which as you note, will vary widely across MD packages) and yield a specific quantity (e.g. reduced potentials, gradients) in a common data structure. The idea here is that we decouple the source of data (MD-engine-specific data files) from analysis of that data. Things like estimators and preprocessing functions don't need to know anything about what generated the data; MD-engine-specific concerns are confined to parser code.

We can also not assume that MD codes provide both TI and FEP data at the same time. There are codes where this is an either-or decision (NAMD), either variant can be be switched off (I think also in Gromacs?) or may not provide some information at all.

This is fine under the current scheme. alchemlyb is a library, so if you didn't generate e.g. what you need for TI, then you wouldn't use anything related to TI from the library. All parsers should definitely have good exceptions with informative messages telling a user whether or not their data is in the right form to be handled upfront, though.

What is the point of each parser having helper functions? Why not return the data structures directly?

The alchemlyb.parsers.gmx module has a generatexvg function that simply takes an EDR file and writes out an XVG file, which the extract* functions need. This is purely a convenience function, and we could remove it. The point, I think, is that these parsing modules should provide to a reasonable degree what a user needs to go from data files to data structures, and the conveniences that help that along will vary by MD engine. Ideally, no convenience functions like generate_xvg would be needed, but I think a case can be made for giving them.

And if helper functions that would point to using classes but that's what you want to avoid?

I think for parsing we should definitely stick to functions yielding standard numerical data structures, specifically pandas DataFrames and Series. I don't believe it makes sense to use classes here.

What comes after estimator? There should be some minimal form of post-processing. I would want to see the raw data in all possible forms (some data is MD code specific so parser should have parser specific options, e.g. see my AMBER parser).

After the estimator, one could do all sorts of things, like convergence analysis. My gist https://gist.github.com/dotsdl/a41e5756a58e1775e3e3a915f07bfd37 shows one form of this, and I believe conveniences should be built for doing this kind of thing, though I don't yet know what they should look like. pymbar.MBAR http://pymbar.readthedocs.io/en/master/mbar.html also exposes a lot of additional functionality for understanding and using the resulting free energies from MBAR, and we could use this as a guide for what we want exposed by alchemlyb generally.

Maybe also basic plotting as it is done now but generally I would advocate against sophisticated plotting functionality. Costs a lot of time and full flexibility can only be achieved with one's own scripting and one's own choice of plotting libraries, etc.

Agreed, though I think plotting should be avoided entirely, and ideally the data structures we're using themselves make plotting straightforward ( pandas objects, numpy arrays). Plotting code comes with a lot of personal preference, making it susceptible to bikeshedding, so I think it best if we focus our limited energy and time on data processing itself.

We also need a reference implementation of a user program.

I think the idea was that alchemical-analysis https://github.com/MobleyLab/alchemical-analysis would be refactored to use alchemlyb components, simplifying it internally and making it double as a useful example of how one can use alchemlyb in practice.

Do we have a NAMD parser?

We don't, but we need one! If you have parsing code, clone the project, make an alchemlyb.parsing.namd module in the source tree, and start by dumping the code directly in. Open a pull request with "WIP" (work-in-progress) in the title and we can get started on molding it into the common form we agree upon for parsers here.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alchemistry/alchemlyb/issues/6#issuecomment-271367597, or mute the thread https://github.com/notifications/unsubscribe-auth/AEE31B80z1vllD2QG7PyFJgXBrKOwXpNks5rQn9PgaJpZM4LdVir .

nathanmlim commented 7 years ago

@vtlim and myself have written up a NAMD parser. But we setup our FEP calculations a bit differently than normal, ~~so I will have to restructure a bit of the code before posting up on here.~~

Edit: I'm just going to link our repo for it here. Maybe you can take a look at it to see what bits you can make use of?

davidlmobley commented 7 years ago

@dotsdl - comments on your initial proposal follow.

A couple thoughts/questions I had:

How are you expecting to handle units? I realize that units are not attached to numbers at input, but it strikes me that we should be attaching them for what we return, via a library (pint? or unit in simtk)?
The spec notes TI and MBAR; will we perhaps also need BAR for some applications? What about Zwanzig (exponential averaging)? (For example, I might calculate hydration free energies in implicit solvent by using only a single gas phase simulation and re-evaluating energies in explicit solvent, so I'd want Zwanzig for that.)

Relating to convergence:

However, the gist shows an example for how this can be done already in practice.

I still require some education as to some things github/open source/etc. What gist are you referring to and where is it?

Relating to gitter:

with a Gitter channel for discussion among developers for fast turnaround on ideas Is this live now? I know I gave some permissions. I'm assuming I would get some sort of a notice? (Sorry, never used gitter before.)

Will we be trying to provide an extensive set of examples (which ideally would be tested via travis) for the different codes along with this, I would hope?

Should we draft a [WIP] pull request to bring these ideas into a main README.md for the library or something similar?

Also, my student @limn1 posted a link to some code he has going which could form the basis of a NAMD parser or help us or you write one once the API is more clear. Should we break this out into a separate issue?

Aside from this somewhat random collection of questions/issues/comments, I'm quite happy with the overall proposal. I'll respond to some of the other comments separately.

davidlmobley commented 7 years ago

I'll also try to reply to the comment from @halx and the response by @dotsdl :

We need to be clear on what the parsers pass back to the main code. We need to be sure what assumption we can and cannot make. The current code is obviously made with Gromacs in mind but other codes possibly work (very) differently. E.g. Gromacs write gradients and BAR/MBAR data on the same time step but AMBER allows doing that independently. Whether that flexibility makes much sense or not is not really relevant when a user has created their output in that way. By main code, do you mean the estimators? The parsers don't pass anything to the estimators themselves; they only take data files as input (which as you note, will vary widely across MD packages) and yield a specific quantity (e.g. reduced potentials, gradients) in a common data structure. The idea here is that we decouple the source of data (MD-engine-specific data files) from analysis of that data. Things like estimators and preprocessing functions don't need to know anything about what generated the data; MD-engine-specific concerns are confined to parser code.

I think @halx is just pointing out that we're going to have to think very carefully through what the parsers are going to return. GROMACS is, from my perspective, actually sort of a bad starting point in this regard because it handles things in the way that makes sense to me -- which means that it comes as more of a shock when you encounter how some of the other codes do things. For example, as Hannes mentions, in AMBER, it is possible to have BAR/MBAR data for one set of timesteps but TI data for different timesteps. A key question would be how this would be handled/communicated to the estimators without them having to know something funky is going on. I suppose it may mean that the data provided back has to have the time attached to it still?

I agree with the other points of discussion, and that plotting should not be within the scope of the library. What about equilibration detection, such as John Chodera's recent reverse equilibration stuff?

Also, another question which didn't make it into my earlier thoughts: What about handling of expanded ensemble simulations such as those @mrshirts has worked on in GROMACS, where one might run a single simulation spanning multiple lambda values? Can those fit into this framework, @mrshirts ?

dotsdl commented 7 years ago

Re: NAMD parser

Thanks @limn1, @vtlim, @mrshirts! I'll open an issue calling for a NAMD parser soon, once we know from this discussion what we want out of them. I don't want to open too many threads with requests before we've got some consensus on direction.

dotsdl commented 7 years ago

@davidlmobley I'll try and answer each of your questions. Thanks for taking the time!

How are you expecting to handle units? I realize that units are not attached to numbers at input, but it strikes me that we should be attaching them for what we return, via a library (pint? or unit in simtk)?

For simplicity, I think we should choose a common set of units in which alchemlyb will yield all quantities. For energies, these should be multiples of kT, and all parsers will yield results in this form from all MD packages. Is there any other unit we need to define as part of a standard set for a library with this limited scope?

The spec notes TI and MBAR; will we perhaps also need BAR for some applications? What about Zwanzig (exponential averaging)? (For example, I might calculate hydration free energies in implicit solvent by using only a single gas phase simulation and re-evaluating energies in explicit solvent, so I'd want Zwanzig for that.)

We could definitely provide a BAR estimator. It could take a two-column reduced-potential DataFrame as input to BAR.fit() similar to MBAR, and that should be all we need. As for Zwanzig, we could also make an estimator for this, since it would also take reduced potentials similar to BAR (right?). Any additional estimators we want will also have a place in the alchemlyb.estimators namespace.

I still require some education as to some things github/open source/etc. What gist are you referring to and where is it?

A gist is a quick way to share code snippets or demos via github without making another repository for it. I made a notebook showing how alchemlyb works under the current proposal, and I shared it as a gist: https://gist.github.com/dotsdl/a41e5756a58e1775e3e3a915f07bfd37

Re: gitter

The gitter channel is live: https://gitter.im/alchemistry/alchemlyb Gitter is similar to slack, but public and linked to activity in the repository. Anyone with a GitHub account can pop in and join the discussion. The point of it is to make low-latency conversation between developers transparent and easy to join.

Will we be trying to provide an extensive set of examples (which ideally would be tested via travis) for the different codes along with this, I would hope?

Documentation will be a big part of this project, probably served through readthedocs. Since this is a library intended to help users get going with alchemical calculations, example-based documentation will definitely be needed. It will probably be worthwhile compiling together a testfiles repo (@mrshirts was interested in pulling together some, IIRC), which alchemlyb could then use for testing and examples (MDAnalysis does something similar). We'd want to keep this relatively small though (with compression) so it's not a big download and can be version controlled and hosted on GitHub.

Should we draft a [WIP] pull request to bring these ideas into a main README.md for the library or something similar?

That sounds like a good idea to me. I can open one up. Merging the PR will amount then to accepting the proposal we've converged on (heh).

Also, my student @limn1 posted a link to some code he has going which could form the basis of a NAMD parser or help us or you write one once the API is more clear. Should we break this out into a separate issue?

Excellent. Yeah, once we've agreed on the direction I'll begin by making a batch of issues for individual things we need like parsers, estimators, and the like. Each of those will focus discussion on what's needed, and we can start to get PRs addressing them.

Aside from this somewhat random collection of questions/issues/comments, I'm quite happy with the overall proposal. I'll respond to some of the other comments separately.

Awesome! I'm excited to begin filling out the library. Happy this is happening. :D

dotsdl commented 7 years ago

For example, as Hannes mentions, in AMBER, it is possible to have BAR/MBAR data for one set of timesteps but TI data for different timesteps.

This won't be a problem under the current scheme, as one would use a different parser for extracting BAR/MBAR data (alchemlyb.parsing.amber.extract_u_nk) and TI data (alchemlyb.parsing.amber.extract_DHdl). The estimators you would then feed each of these results to would be different, and they generally won't care about time information (though that will be retained, since these are DataFrames and Series, respectively). Having parsing well-separated from analysis in this way avoids the problem of estimators having to worry about how different MD engines do things.

What about equilibration detection, such as John Chodera's recent reverse equilibration stuff?

We've got that already implemented as a preprocessing function.

What about handling of expanded ensemble simulations such as those @mrshirts has worked on in GROMACS, where one might run a single simulation spanning multiple lambda values?

I've never worked with data like this, but if it's both possible and meaningful to extract the data from these into a form that alchemlyb is proposed to work with, then we'd simply need a parser for them. We could add it to the alchemlyb.parsing.gmx module, assuming the use case isn't already covered by the existing parsers.

mrshirts commented 7 years ago

What about handling of expanded ensemble simulations such as those @mrshirts has worked on in GROMACS, where one might run a single simulation spanning multiple lambda values? I've never worked with data like this, but if it's both possible and meaningful to extract the data from these into a form that alchemlyb is proposed to work with, then we'd simply need a parser for them. We could add it to the alchemlyb.parsing.gmx module, assuming the use case isn't already covered by the existing parsers.

Basically, you just need to associate with each datapoint of U or dU/dL data the thermodynamic state that it came from. So you really only need the u_kn matrix to be able to calculate the data.

Another thing to watch out for is equilibration of the dynamics: In that case, one needs to pay attention to whether the weights are still moving or not. That is all at the parser level.

The other thing is determining correlation times for either expanded ensemble or replica exchange data. In that case, it becomes significantly more complicated, because you have a series of time series which have correlations between them.

mrshirts commented 7 years ago

David (Dotson), should we sit down at some point and walk through what the alchemical-analysis code is doing? There's a lot of choices there that could help inform the design of anything new, and walking through the code would probably be the best way to go over those issues.

baoilleach commented 7 years ago

Quick comment from experience with cclib. It's useful to have a convenience method that autodetects the file format and parses it. I don't know if this is possible for the MD files in question but it makes usage easier at the Python command-prompt and indeed when integrated into, for example, a File Open dialog box.

BTW, while I love the functionality of scikit-learn, I really don't like its API. It's really un-Pythonic and horrible. I just don't get it.

halx commented 7 years ago

Here are some answers in random order:

Noel, the automatic detection can be done but will require some heuristics and extra hoops to deal with the formats. They are all so different that I do not see a common ground for a universal parser. BTW, what is your interest in this?

Parsers: so the helper functions are parser specific and we just return data types. I do not know pandas but that reads to as one more dependency to the code. Why would I want that format? What the parsers needs to be provided with is list (generator) of file handles/methods. I would want to also be able to handle compressed data streams (I have done this in a simple fashion in the AMBER parser). There may be some extra information e.g. add the moment the temperature, time step information, chosen methods. Almost all of this needs to go and be handled at the higher level. We can distinguish between FEP-like and TI data but what is the best way to detect e.g. that the data is only good for BAR (the Desmond output seems to do only that and NAMD I think). As said, the parsers and all other modules should only really know what they need to know. To make all parsers provide data in a pre-defined unit is a good idea but kT would mean that all parsers need to convert and I would think that a least-effort approach may be better.

Estimators: we should provide whatever is interesting, at the minimum what is available now in alchemistry analysis (we really need a shorter name). That does not mean the alchemlyb library needs to implement those right out of the box. We just need a convenient mechanism for extension (which should be trivial).

NAMD: you guys are aware that with NAMD 2.12 the output file format has changed, at least for TI? Also TI and FEP are mutually exclusive and I think the formats are quite different (only have TI examples at the moment). We will need to be able to handle all of those.

baoilleach commented 7 years ago

...no real interest - just a drive-by comment :-)

halx commented 7 years ago

@baoilleach Don't worry, I'll return the favour one day... :-)

davidlmobley commented 7 years ago

@dotsdl :

For simplicity, I think we should choose a common set of units in which alchemlyb will yield all quantities. For energies, these should be multiples of kT, and all parsers will yield results in this form from all MD packages. Is there any other unit we need to define as part of a standard set for a library with this limited scope?

OK, I think this is all right, as long as we're willing to insist on just kT units. Units support becomes important if people start wanting different energy units as options.

We could definitely provide a BAR estimator. It could take a two-column reduced-potential DataFrame as input to BAR.fit() similar to MBAR, and that should be all we need. As for Zwanzig, we could also make an estimator for this, since it would also take reduced potentials similar to BAR (right?). Any additional estimators we want will also have a place in the alchemlyb.estimators namespace.

Yes, the input for Zwanzig is the same as for, say, BAR, except that you might have work values only in one direction.

I made a notebook showing how alchemlyb works under the current proposal, and I shared it as a gist: https://gist.github.com/dotsdl/a41e5756a58e1775e3e3a915f07bfd37

I hadn't spotted this when I commented. I saw it later. ( @limn1 - you might want to take a look at that link and comment if you have thoughts, as it has some overlap with some of the things you were pointing out not that long ago.)

Since this is a library intended to help users get going with alchemical calculations, example-based documentation will definitely be needed. It will probably be worthwhile compiling together a testfiles repo (@mrshirts was interested in pulling together some, IIRC), which alchemlyb could then use for testing and examples (MDAnalysis does something similar). We'd want to keep this relatively small though (with compression) so it's not a big download and can be version controlled and hosted on GitHub.

This sounds great.

@halx made a couple other comments I should reply to:

We can distinguish between FEP-like and TI data but what is the best way to detect e.g. that the data is only good for BAR (the Desmond output seems to do only that and NAMD I think)

I think the point he's making here is that if you're not that experienced, you may not even know whether the files you have provide CONTAIN data for a particular analysis method -- e.g. if you've run DESMOND, you can only do BAR (not MBAR, TI, etc.), so there is no point in trying to apply parsers for TI, etc. to DESMOND output. Still, I think this can be handled by just (a) providing good documentation, and (b) providing informative error messages, so if you try to extract data for TI from Desmond output it will tell you that you can't do it (and why)

To make all parsers provide data in a pre-defined unit is a good idea but kT would mean that all parsers need to convert and I would think that a least-effort approach may be better.

Well, conversion into standard units has to be done somewhere. It seems like either the parsers have to do it, or they have to return data with units attached. It seems intellectually simplest to me to just have them use standard units.

NAMD: you guys are aware that with NAMD 2.12 the output file format has changed, at least for TI? Also TI and FEP are mutually exclusive and I think the formats are quite different (only have TI examples at the moment). We will need to be able to handle all of those.

We only have "FEP" (BAR/MBAR) examples from NAMD, not TI examples, so if you can provide TI then we should probably have both covered.

dotsdl commented 7 years ago

@mrshirts, I'd definitely appreciate a sit-down to look at the current state of the alchemical-analysis script. Ideally we'd be able to replace much of the meat of that script with alchemlyb components, so looking at the problem from that direction would be beneficial. Ping me via email or the alchemistry slack and we'll set up a chat.

dotsdl commented 7 years ago

@halx, see below:

Parsers: so the helper functions are parser specific and we just return data types. I do not know pandas but that reads to as one more dependency to the code. Why would I want that format?

pandas has become a standard library in the scientific Python stack these days, and its data structures are a better fit for data than simple numpy arrays (which are great for values on which computation must be done, and on which pandas objects are built). They allow for greater power and flexibility when working with tabular data or timeseries than simply using numpy, and so using them as a primitive for data in alchemlyb is a good fit.

There may be some extra information e.g. add the moment the temperature, time step information, chosen methods. Almost all of this needs to go and be handled at the higher level.

I'm not sure this is really necessary, as no estimators rely directly on any of this information, to my knowledge. We only need things like reduced potentials or derivatives of reduced potentials with respect to lambdas (which already roll in things like temperature), and so the parsers for these can exist as separate functions. And if a package, such as Desmond, only ever gives data good for doing BAR, then we probably wouldn't provide a function like extract_DHdl in alchemlyb.parsing.desmond. Alternatively, if we did, it would raise an exception explaining the situation.

Estimators: we should provide whatever is interesting, at the minimum what is available now in alchemistry analysis (we really need a shorter name). That does not mean the alchemlyb library needs to implement those right out of the box. We just need a convenient mechanism for extension (which should be trivial).

Adding a new estimator just means adding a new module. We'll generally try to keep the API consistent between estimators (as scikit-learn does for its >100 estimators) as we go, which I think is probably doable given that existing ones work with either potentials or derivatives of potentials.

dotsdl commented 7 years ago

@davidlmobley it looks like we're in agreement on a lot of things. I don't see anything in your post to comment further on that hasn't already been said. :)

halx commented 7 years ago

@dotsdl

pandas has become a standard library in the scientific Python stack these days, and its data structures are a better fit for data than simple numpy arrays (which are great for values on which computation must be done, and on which pandas objects are built). They allow for greater power and flexibility when working with tabular data or timeseries than simply using numpy, and so using them as a primitive for data in alchemlyb is a good fit.

I think you need to step outside the alchemlyb box for a moment. I want to develop the parsers totally independent from the needs of some particular software. I or somebody else may want to use the parser in a different project some time later when a certain data format is not the big rage anymore (or does not wish to us pandas or whatever). The current situation demonstrates this perfectly well. What should be done is to keep the parsers as simple as possible which also implies to have them only return simple builtin data structures. A mid layer can then translate this to whatever the needs are further downstreams.

I'm not sure this is really necessary, as no estimators rely directly on any of this information, to my knowledge. We only need things like reduced potentials or derivatives of reduced potentials with respect to lambdas (which already roll in things like temperature), and so the parsers for these can exist as separate functions. And if a package, such as Desmond, only ever gives data good for doing BAR, then we probably wouldn't provide a function like extract_DHdl in alchemlyb.parsing.desmond. Alternatively, if we did, it would raise an exception explaining the situation.

I suggest to have a look at all those Zwanzig derived formulars. What is happening in aa at the moment is that the user can provide a temperature T on the command line. That T is then used to scale the energies accordingly (the current suggestion is use kT as the unit!). Timestep information is used to to prune initial data. While we may have automatic "equilibration" checking we may want to keep this still there. But the crucial point is that we will want to, at least, pass mack some important run time parameters, for reporting purposes and for consistency checking. We also need to think about early failure detection. With automation this really starts to matter.

Regarding BAR the question really was BAR vs MBAR. Currently this is done through numpy arrays but how can you tell whether the data in the array is good for MBAR. We obviously needs some tagging mechanism. @mrshirts would a 'nan' work for this or do you any other suggestion other than handling this at the client level (I mean the code that calls pymbar).

@davidlmobley

OK, I think this is all right, as long as we're willing to insist on just kT units. Units support becomes important if people start wanting different energy units as options.

And they will. But in practice this will be just a matter of output i.e. convert the units wants from the internal format to the target one.

Well, conversion into standard units has to be done somewhere. It seems like either the parsers have to do it, or they have to return data with units attached. It seems intellectually simplest to me to just have them use standard units.

Ok, simplicity is always a good argument :-)

We only have "FEP" (BAR/MBAR) examples from NAMD, not TI examples, so if you can provide TI then we should probably have both covered.

Off-list or would it be a good idea to "attach" somewhere here?

I am currently in contact with Brian Radak, who works on the alchemical code in NAMD, and he says he is open to file format changes. I suggest to add something like '# NAMD 2.12 TI '. That would help also in auto-dectecting the file format. The time stamp or other timing information would help a great lot to properly stitch together files for the same lambda from different runs (relying on a file name convention or time stamp on the file is not really a good idea). Do you have needs for other meta data?

EDIT: Brian is writing analysis tools himself https://github.com/radakb/pynamd including FEP/TI analysis so may be a good idea to look into that. He also points out that alchfileout for TI reports block averages which break correlation analysis of course. So best would be to parse the TI line in the output file (the one that namd writes to stdout).

orbeckst commented 7 years ago

@halx https://github.com/alchemistry/alchemlyb/issues/6#issuecomment-272107510

I want to develop the parsers totally independent from the needs of some particular software. I or somebody else may want to use the parser in a different project some time later when a certain data format is not the big rage anymore (or does not wish to us pandas or whatever). The current situation demonstrates this perfectly well. What should be done is to keep the parsers as simple as possible which also implies to have them only return simple builtin data structures. A mid layer can then translate this to whatever the needs are further downstreams.

My understanding was that the parsers would be part of a module in alchemlyb.parsers. As such they would simply share the same set of library dependencies as everything else in alchemlyb. Given that pandas is an excellent tool for dealing with timeseries data and is easily installable, I see no problem with having it as an alchemlyb dependency and hence also available for the parsers. I would not artificially limit the library to just using pure python or only numpy – the Python eco system has moved forward over the last couple of years and there are many excellent packages that make life so much easier.

I would go as far and say that the pandas.DataFrame should be a fundamental data structure throughout the library.

We have similar discussions at MDAnalysis and over the last 10 years we have seen an enormous improvement in how packages can be installed – locally and on large super computers – so the "keep everything most basic" argument is not quite as powerful as it once was. It is certainly true that one incurs debt by relying on someone else's API and no-one likes refactoring in order to catch up with API breaks. But I don't expect a library like alchemlyb to be completed and done and then no-one touches it anymore. Rather, if the developer community around it becomes quiescent then the project will die, too. At least in the Python world I see software as something pretty dynamic that is very much entangled with the people who develop and maintain it. So, as long as there are developers, someone will update APIs – that's pretty much what I see.

dotsdl commented 7 years ago

@halx I think we disagree in some key places here, but I'll try and lay out my thinking in response and go from there.

I think you need to step outside the alchemlyb box for a moment. I want to develop the parsers totally independent from the needs of some particular software. I or somebody else may want to use the parser in a different project some time later when a certain data format is not the big rage anymore (or does not wish to us pandas or whatever). The current situation demonstrates this perfectly well. What should be done is to keep the parsers as simple as possible which also implies to have them only return simple builtin data structures. A mid layer can then translate this to whatever the needs are further downstreams.

I don't think putting alchemical data into pure-Python primitives is the answer here, and I think attempting to do so is an effort to "future-proof" at great cost to performance (and this is a numerical library). pandas is now considered part of the core scientific python stack, and so it makes sense to use it here.

I suggest to have a look at all those Zwanzig derived formulars. What is happening in aa at the moment is that the user can provide a temperature T on the command line. That T is then used to scale the energies accordingly (the current suggestion is use kT as the unit!). Timestep information is used to to prune initial data. While we may have automatic "equilibration" checking we may want to keep this still there. But the crucial point is that we will want to, at least, pass mack some important run time parameters, for reporting purposes and for consistency checking. We also need to think about early failure detection. With automation this really starts to matter.

If you haven't already, I recommend reviewing my gist for how all these pieces fit together. I think all of what you describe here is accomplished by the proposed API.

Regarding BAR the question really was BAR vs MBAR. Currently this is done through numpy arrays but how can you tell whether the data in the array is good for MBAR. We obviously needs some tagging mechanism. @mrshirts would a 'nan' work for this or do you any other suggestion other than handling this at the client level (I mean the code that calls pymbar).

Not sure what you mean here, but I think this is a detail we can discuss later. It's a bit too in the weeds I think to focus on here.

I am currently in contact with Brian Radak, who works on the alchemical code in NAMD, and he says he is open to file format changes. I suggest to add something like '# NAMD 2.12 TI '. That would help also in auto-dectecting the file format. The time stamp or other timing information would help a great lot to properly stitch together files for the same lambda from different runs (relying on a file name convention or time stamp on the file is not really a good idea). Do you have needs for other meta data?

EDIT: Brian is writing analysis tools himself https://github.com/radakb/pynamd including FEP/TI analysis so may be a good idea to look into that. He also points out that alchfileout for TI reports block averages which break correlation analysis of course. So best would be to parse the TI line in the output file (the one that namd writes to stdout).

Okay cool, we'll get started on pulling parsers from around the community after we've closed this issue. Thanks @halx for getting this going and pointing this work out!

dotsdl commented 7 years ago

So I'm interested in moving on to implementation starting next week (Jan 23), but I also want to give this discussion time to finish on its own terms. If you have issues/suggestions for this proposal, try and get your word in over the next few days so we can find a satisfying conclusion and keep things moving forward. Remember that this is just a starting point, and things can change later if we find unforseen issues with the plan.

dotsdl commented 7 years ago

Seeing no further discussion, I'm going to start this week raising issues for components that are missing, and we'll get to work. In particular we'll be looking for initial code for parsers in different formats, and we'll work to get these into the form demonstrated here. I'll leave this issue open for another week or so for discussion purposes.

The proposal at the top of the issue will be included in the initial docs, which will also go up this week hopefully. Thanks all!

dotsdl commented 7 years ago

This proposal now has a home in the docs, and will serve as a guidepost for devs on what we're aiming for here: http://alchemlyb.readthedocs.io/en/latest/api_proposal.html