Scope of the project and relationship to openmoltools

andrrizzi commented 8 years ago

This is awesome! Thanks for opening this @gregoryross !

I'd like to open a discussion to what should go here and how this is going to relate to openmoltools. Currently, openmoltools is organized by "dependency".

do we envision this as a setup library organized by tasks instead?
should we put all dependency-specific utility functions in openmoltools?

In general, I think this repo will be as useful to the group only if it is well organized. We don't want to create an openmoltools2.

gregoryross commented 8 years ago

Good questions!

The aim of this repo should be summarize best practices for the lab when performing common task. The 'best practices' should be encompassed by functions with simple inputs and outputs.

Yes, this repo should be organized based on tasks and functionality. An emphasis on wrapper functions that simplify the completion of what should be basic, but vital tasks (like ligand protonation, docking, protein preparation etc) so that the whole group is clear on what methods we should use for common, bread-and-butter tasks. In fact, there's no reason why functions from openmoltools can't be referenced, so long as the inputs and outputs are clear and interpretable. We certainly don't want an openmoltools2!

gregoryross commented 8 years ago

One extra point, there's lots of very useful stuff in openmoltools, but its hard to know what's there, where to find it if you know what you're looking for. This repo can contain easily accessible and interpretable front-ends to many of tools in openmoltools.

bas-rustenburg commented 8 years ago

I do think the openmoltools (omt) dependency based structure makes sense. I think the point of omt is to group a lot of simple utility functions together, based on whether you have access to them license wise.

I can imagine something like this becoming a repository where we integrate a lot of those in much more complicated functions.

gregoryross commented 8 years ago

I would certainly like to work up to more complicated functions. For now, I think we should focus on the simple, common, things that we have to do when setting up simulations.

Something that omt lacks is any sort description of what software and method (with dependencies) is preferred in the lab for a given task. This repo can provide that information in a task focused way.

I think there should also be open-source/license free alternatives for tasks in case our licenses fall through. And I like things to be open-source :)

jchodera commented 8 years ago

We should definitely rope @LNaden into this discussion as well!

bas-rustenburg commented 8 years ago

Something that omt lacks is any sort description of what software and method (with dependencies) is a description is preferred in the lab for a given task. This repo can provide that information in a task focused way.

That sounds good.

I think there should also be open-source/license free alternatives for tasks in case our licenses fall through.

I doubt one of our licenses will ever expire by surprise, and we should be able to code up an alternative pipeline when the need arises.

I think decisions about what tools we use/implement should also depend on whether it's worth maintaining. Let's not be overly ambitious, we're not a software company and we're not trying to make as diverse software as possible. Our primary goal should be serving ourselves to meet scientific needs, and we should use the best tools available to us get the job done.

That said, we probably already will have plenty of scientific reasons to use several software packages, so I think we'll end up with quite a diverse set anyways.

andrrizzi commented 8 years ago

Our primary goal should be serving ourselves to meet scientific needs

I agree with this a lot, especially at this stage.

Personally, the way I would proceed to create the library would be to first determine the big steps that we need to solve during setup (e.g. settle protonation/tautomeric state of molecule, parametrization, loop modelling, mutations etc.) and create simple functions that represent the best practice within the lab for that task (with options to switch to a different route in case the first one is unavailable/not ideal). Something like (this is strongly biased towards the YANK setup pipeline, there could be better designs)

my_molecule = Molecule(file_path='my_molecule.mol2')  # we can have smiles or name as source
my_molecule.protonate_epik(select=0, ph=7.4, ...)
my_molecule.charge_quacpac()
my_molecule.parametrize_antechamber(charge_method=None)  # use charges from OpenEye
my_molecule.save('dir/my_molecule.sdf')  # we can use different tools for different formats here since no tool is the best for all

receptor = Molecule(file_path='receptor.pdb')
my_molecule.dock_openeye(receptor, start_position=(1.0, 2.0, 3.0))

Places that I'm aware of that have already code that integrate a lot of different tools to solve some of these problems

yank/systembuilder.py: not working for now, but inspiringly well structured.
yank/yamlbuilder.py: contains code for most of what are the best practices in the lab (I think), but (purposely) without an extensible API right now.
ensembler: I'm not very familiar with this codebase but contains a lot of automatic protein preparation code which is missing in YANK.

I guess the first reasonable task for this project is to import that code in here and give it a nice interface?

jchodera commented 8 years ago

Important question: Should we really tie ourselves to particular tools in the API, and make this a real "pipeline" that chains them together? Or should we design the API with best practices and flexibility in mind, where the tools that are used under the hood are incidental and subject to change?

We should also really make our initial "capture" phase for requirements center on use cases rather than tools. What do we need to do to biomolecular systems to go from inputs to outputs, what makes that hard now, and what would an API that makes this easy look like? We should start with the various use cases we have in the lab right now.

Lnaden commented 8 years ago

I think before we can answer how we want the API to look, we should really settle on a scope.

If we want to extend this beyond the group, then I think the more generalized option would be better, but harder to manage (consistent input/outputs across replacement tools, highly disciplined code maintenance, frequent flow reworks as new tools are made). This options would make the openmoltools more like a full software company and probably a bit ambitions since I doubt we could effectively assign a dedicated "dev" for the code.

The specific tools approach would be easier, but harder to incorporate new tools as old ones are made obsolete. We would still want to enforce consistent input/output for any given tool, but then the library can be treated more as a collection of helper functions than any one pipeline. Examples of what a pipeline would look like can then be written for those outside the group. Other potential problems I see with this option: difficult to maintain unifying documentation which explains how everything works together, lots of function depreciation as new/updated tools are added which may make older code hard to run, and it runs the risk of kind of becoming a dumping ground for "hey this is a helpful very specific function, lets put it here!"

I'm not too familiar with whats in openmoltools right now, but that'y my thoughts on direction with the project.

andrrizzi commented 8 years ago

the tools that are used under the hood are incidental and subject to change

This would be ideal. The difficulty here is that different tools have different options, and have slightly different scopes, so designing a single method for all of them could be tricky.

center on use cases rather than tools

Absolutely agree. This is what I meant by "first determine the big steps that we need to solve during setup". The code above was just an example to make clear my general idea.

gregoryross commented 8 years ago

I strongly agree with focusing on use-cases and problems that are commonly encountered in the lab. A few of these have been mentioned already, for instance, protonation/generating and picking tautomers of small molecules, ligand docking, protonating proteins, etc. Figuring out the precise list of common use-cases is a priority. Deciding on the best way(s) to tackle these problems and encapsulating the solution(s) in easy set of tools should be the primary goal of this repo.

A very important aspect of this repo should be the documentation. At present, there isn't any place where one can find out 'how should I do X', where X is a task that's typically encountered by the group but is new to the particular user. That is certainly something that should be within the scope of this repo, and is beyond the scope of openmoltools.

I like the look of the API suggested @andrrizzi, as that looks intuitive and easily extendable with new/better functions. Having something like that would be great as a medium-term goal. Before settling on an API right now though, I'd like the short-term goal of this repo to establish documentation and functions that serve as indicators on how to address our common setup issues.

jchodera commented 8 years ago

I think there are two distinct sets of objectives here:

Document best practices for using existing tools to accomplish takss ("How do I do X?") for the lab. Perhaps this could be in a lab-best-practices or drylab-protocols repo
Construct a pipeline or workflow that makes it easier to carry out best practices (existing, desired, or under investigation) for a wide variety of tasks, especially those involving OpenMM. This is really what we intended mmtools and openmoltools to be, though they are currently just "dumping grounds" for useful code at the moment. @lnaden will presumably have a major role in engineering this, since we would want to factor out setup pipeline code from YANK into a standalone package over the next year.

gregoryross commented 8 years ago

I think those points summarise things nicely. However, I think it would be really useful to have the code and documentation in the same repo. That way, what we state as best practice won't be divorced from what we actually do when setting up and running simulations.

jchodera commented 8 years ago

I think we definitely want to evolve toward a simulation setup pipeline that encompasses all best practices, but the way to do this is definitely not just to pull in all the relevant code from mmtools, openmoltools, simtk.openmm.app, ensembler, etc., and throw some documentation alongside it. That would be as much of a disaster as openmoltools is now.

What we really need is a plan for building this pipeline, involving

use cases requirements capture
architecting an API that can support experimentation with best practices
implementing the API
experimenting with best practices workflows

In parallel, we right now need some way to organize some suggestions about the best way to accomplish some tasks, much like we have wetlab standard operating procedures (SOPs) in https://github.com/choderalab/lab-protocols

andrrizzi commented 8 years ago

However, I think it would be really useful to have the code and documentation in the same repo.

I agree with this. We could start creating here a document on how we do stuff right now showing examples that use openmoltools/mdtraj/etc. functions, and think about the best way to give them a consistent interface in a second moment. I think the document would serve well as a first use cases capture.

choderalab / Drylab-Protocols

Scope of the project and relationship to openmoltools #2