MDAnalysis / solvation-analysis

A comprehensive tool for analyzing liquid solvation structure.
https://solvation-analysis.readthedocs.io/en/latest/
GNU General Public License v3.0
46 stars 13 forks source link

Support for multi atom solutes #75

Closed orionarcher closed 2 years ago

orionarcher commented 2 years ago

Description

This PR was split off from PR #70 to better separate quality-of-life changes from API changes. See that PR for some history and discussion.

This is intended to be a major PR to handle multi-atom solutes. It relates to issues #47, #66, and #58.

I would like to propose the following outline of new functionality. In this description, I will focus on the outward facing API. I'll use the somewhat trivial case of water as an example.

  1. Solution will be renamed to Solute. All references to solute in the current documentation will be renamed to solvated atom or solvated atoms. I think this better captures what the Solute class really is, especially as we expand to multi-atom Solutes.

  2. The default initializer for Solute will take only a single atom per residue. It will not support multiple identical atoms on a residue. This will be handled by the more general case. As a result, instantiating a Solute for a single atom remains the same.

water = u.select_atoms(...)
water_O = u.select_atoms(...)
water_O_solute = Solute(water_O, {"water": water})
  1. Additional initializers will be added to instantiate a Solute, these will support multi-atom solutes.

The first will allow the user to stitch together multiple solutes to create a new solute.

water_H1 = u.select_atoms(...)
water_H2 = u.select_atoms(...)
solute_O = Solute(water_O, {"water": water})
solute_H1 = Solute(water_H1, {"water": water})
solute_H2 = Solute(water_H2, {"water": water})

multi_atom_solute = Solute.from_solutes([solute_O, solute_H1, solute_H2])  # maybe this should be a dict?

The second will allow users to simply instantiate a solute from an entire residue (or part of a residue). There may be technical challenges here so this behavior is not guaranteed.

multi_atom_solute = Solute.from_residue(water, {"water": water})
  1. To support this, the solvation_data dataframe will have two additional columns added, a "residue" column and a "solute_name" column. All analysis classes will be refactored to operate on the "residue" column rather than the "solvated_atom" column. This will make no difference for single-atom solutes but will allow the analysis classes to generalize easily. I'm not completely sure the "solute_name" column is necessary, but it would be convenient to have.

  2. When a multi-atom solute is created all of the solvation_data dataframes from each constituent single-atom solute will be merged together. The "residue" column will group together solvated atoms on the same residue such that the analysis classes can operate on the whole solute. The API for accessing the residence classes will be identical.

multi_atom_solute.coordination_number["water"]   # valid property
  1. We will retain all of the single atom Solutes as a property of the multi-atom Solute. This would amount to a rough doubling of the memory footprint, but it would make follow up analysis easier. I'm a bit torn here and there may be a better way.
>>> print(water.atoms)  # what should this be called?
>>> [solute_O, solute_H1, solute_H2]  # maybe this should be a dict?

For a single atom solute the atoms list would still be present but the data within would be identical to the solvation_data of the solute itself. Single atom solutes are now just a special case of multi-atom solutes.

water_O_solute.atoms[0].solvation_data = water_O_solute.solvation_data

I'm sure there are many things I am not considering that will come up later, but as a start, I think this plan will allow the package to be generalized with maximum code reuse. I'd love feedback or suggestions on any aspect of the outline above.

Todos

Notable points that this PR has either accomplished or will accomplish.

Status

codecov[bot] commented 2 years ago

Codecov Report

Merging #75 (732e67d) into main (4f962b8) will increase coverage by 0.75%. The diff coverage is 98.72%.

Additional details and impacted files
orionarcher commented 2 years ago

Hi @hmacdope and @rkingsbury, I wanted to seek feedback on one specific issue. I am planning to rename several columns in solvation_data and I wanted more eyes to make sure it's sensible.

In particular I am removing references to ix in favor of a simpler nomenclature. This is hopefully a bit more approachable while still distinguishing what everything is. Of course, the details will be in the documentation.

frame == frame NEW COLUMN = "solute" solvated_atom => solute_atom atom_ix => solvent_atom dist => distance NEW COLUMN => solute_name resname => solvent_name res_ix => solvent

rkingsbury commented 2 years ago

Hi @hmacdope and @rkingsbury, I wanted to seek feedback on one specific issue. I am planning to rename several columns in solvation_data and I wanted more eyes to make sure it's sensible.

In particular I am removing references to ix in favor of a simpler nomenclature. This is hopefully a bit more approachable while still distinguishing what everything is. Of course, the details will be in the documentation.

frame == frame NEW COLUMN = "solute" solvated_atom => solute_atom atom_ix => solvent_atom dist => distance NEW COLUMN => solute_name resname => solvent_name res_ix => solvent

@orionarcher I think this looks good. Looking only at the column names, it's not immediately clear that solute, solute_atom, solvent, and solvent_atom are going to be numbers (indices) but I think that will be clear in context.

An alternative that would make this slightly more explicit would be to drop the _name suffix for the str fields but use _idx for all the numerical fields, e.g.

NEW COLUMN = "solute_idx" solvated_atom => solute_atom_idx atom_ix => solvent_atom_idx NEW COLUMN => solute resname => solvent res_ix => solvent_idx

I don't have a strong opinion one way or another though

hmacdope commented 2 years ago
  1. Solution will be renamed to Solute. All references to solute in the current documentation will be renamed to solvated atom or solvated atoms. I think this better captures what the Solute class really is, especially as we expand to multi-atom Solutes.

I think this makes sense but if so what does a multi-solute thing become, a solute collection? If the Solute class encompasses multiple species is the nomenclature then slightly confusing? Up to you at the end of the day.

  1. The default initializer for Solute will take only a single atom per residue. It will not support multiple identical atoms on a residue. This will be handled by the more general case. As a result, instantiating a Solute for a single atom remains the same.

Yeah that make sense to me.

water = u.select_atoms(...)
water_O = u.select_atoms(...)
water_O_solute = Solute(water_O, {"water": water})
  1. Additional initializers will be added to instantiate a Solute, these will support multi-atom solutes.

The first will allow the user to stitch together multiple solutes to create a new solute.

water_H1 = u.select_atoms(...)
water_H2 = u.select_atoms(...)
solute_O = Solute(water_O, {"water": water})
solute_H1 = Solute(water_H1, {"water": water})
solute_H2 = Solute(water_H2, {"water": water})

multi_atom_solute = Solute.from_solutes([solute_O, solute_H1, solute_H2])  # maybe this should be a dict?

I like the new construction pattern but again my first instinct that there is no separation between Solute / collection of solutes I find slightly confusing. Perhaps I am misinterpreting what is meant by solute.

The second will allow users to simply instantiate a solute from an entire residue (or part of a residue). There may be technical challenges here so this behavior is not guaranteed.

multi_atom_solute = Solute.from_residue(water, {"water": water})
  1. To support this, the solvation_data dataframe will have two additional columns added, a "residue" column and a "solute_name" column. All analysis classes will be refactored to operate on the "residue" column rather than the "solvated_atom" column. This will make no difference for single-atom solutes but will allow the analysis classes to generalize easily. I'm not completely sure the "solute_name" column is necessary, but it would be convenient to have. Sure!
  2. When a multi-atom solute is created all of the solvation_data dataframes from each constituent single-atom solute will be merged together. The "residue" column will group together solvated atoms on the same residue such that the analysis classes can operate on the whole solute. The API for accessing the residence classes will be identical.
multi_atom_solute.coordination_number["water"]   # valid property

Yep this solves the multi atom issue nicely.

  1. We will retain all of the single atom Solutes as a property of the multi-atom Solute. This would amount to a rough doubling of the memory footprint, but it would make follow up analysis easier. I'm a bit torn here and there may be a better way.
>>> print(water.atoms)  # what should this be called?
>>> [solute_O, solute_H1, solute_H2]  # maybe this should be a dict?

Is water.atoms meant to be the solutes that underly a solute collection solute? If so i would say call it "components" or somthing more geenral to avoid a collision with the MDA pattern u.atoms -> AtomGroup

For a single atom solute the atoms list would still be present but the data within would be identical to the solvation_data of the solute itself. Single atom solutes are now just a special case of multi-atom solutes.

water_O_solute.atoms[0].solvation_data = water_O_solute.solvation_data
hmacdope commented 2 years ago

Hi @hmacdope and @rkingsbury, I wanted to seek feedback on one specific issue. I am planning to rename several columns in solvation_data and I wanted more eyes to make sure it's sensible. In particular I am removing references to ix in favor of a simpler nomenclature. This is hopefully a bit more approachable while still distinguishing what everything is. Of course, the details will be in the documentation. frame == frame NEW COLUMN = "solute" solvated_atom => solute_atom atom_ix => solvent_atom dist => distance NEW COLUMN => solute_name resname => solvent_name res_ix => solvent

@orionarcher I think this looks good. Looking only at the column names, it's not immediately clear that solute, solute_atom, solvent, and solvent_atom are going to be numbers (indices) but I think that will be clear in context.

An alternative that would make this slightly more explicit would be to drop the _name suffix for the str fields but use _idx for all the numerical fields, e.g.

NEW COLUMN = "solute_idx" solvated_atom => solute_atom_idx atom_ix => solvent_atom_idx NEW COLUMN => solute resname => solvent res_ix => solvent_idx

I don't have a strong opinion one way or another though

Agree with @rkingsbury that it would be nice to have a suffix that indicates that something is likely to be numerical vs str based but otherwise i like the new simpler nomenclature.

orionarcher commented 2 years ago

I think this makes sense but if so what does a multi-solute thing become, a solute collection? If the Solute class encompasses multiple species is the nomenclature then slightly confusing? Up to you at the end of the day.

A multi-atom Solute is still just a Solute. The ONLY api difference between a single-atom Solute and a multi-atom Solute is how many Solutes are in the solute.atoms property (though I should rename). For a single-atom Solute, solute.atoms is a dict with only itself.

solute = Solute.from_atoms(solute, solvents, solute_name="solute")
assert solute.atoms["solute"] is solute  # True

Whereas for a multi-atom solute the solute.atoms attribute contains a single-atom Solute for every atom in the Solute. It's Solutes all the way down.

When looking at solvation of a multi-atom solute it's common to want to look at both atom-level attributes, like coordination around a specific functional group, and residue-level attributes, like formation of coordination networks. I like that this provides an identical API for doing both. It's true that some attributes make more sense to look at on a single-atom level, like coordination and speciation, whereas others make sense to look at on a residue level, like pairing and networking. Nonetheless, I'd prefer to leave the distinction up to the user and not impose a specific way of looking at the Solute. Does that make sense?

If its confusing to you it will almost certainly be confusing to users so I think it should be well-justified.

Is water.atoms meant to be the solutes that underly a solute collection solute? If so i would say call it "components" or somthing more geenral to avoid a collision with the MDA pattern u.atoms -> AtomGroup

Maybe solute.solutes as rkingsbury suggested or solute.atom_solutes?

Agree with @rkingsbury that it would be nice to have a suffix that indicates that something is likely to be numerical vs str based but otherwise i like the new simpler nomenclature.

Yeah I agree, perhaps I'll go with the following. Very explicit but still reasonably short.

frame == frame NEW COLUMN = solute_res_ix solvated_atom => solute_atom_ix atom_ix => solvent_atom_ix dist => distance NEW COLUMN => solute resname => solvent res_ix => solvent_res_ix

orionarcher commented 2 years ago

Update:

I've pretty much finished the multi-atom solute functionality as described above. I've got a decent amount of testing complete as well with CodeCov approaching 100%.

What remains is a lot of work on the documentation. The new API is a bit more complicated and the old documentation wasn't great in the first place. I'll probably do a bare minimum rewrite of the docs and tutorials and save more extensive documentation rewrites for another PR.

A few notes on where things landed:

I implemented multi-atom solutes by creating a private _run_solute_atoms method that gets monkey-patched over run when the solute has more than one atom.

There are now three constructors, all three work for both single-atom and multi-atom solutes, it's just a matter of whether they use the default run or the monkey-patched run.

The from_atoms constructor takes an AtomGroup. This is the least flexible constructor and does the most work behind the scenes.

water = u.select_atoms("resname water")
water_O = water.select_atoms("element O")
water_O_solute = Solute.from_atoms(water_O, {"water": water})

# OR

water_solute = Solute.from_atoms(water, {"water": water})

The from_atoms_dict constructor takes a dict of solute names (strings) and AtomGroups. Each AtomGroup needs to only have one atom per solute, which is enforced by assertions. this provides an intermediate level of flexibility because it allows you to name your atom solutes.

water_solute = Solute.from_atoms_dict({"water_O": water_O, "water_H1": water_H1, "water_H2": water_H2}, {"water": water)

Finally, the from_solutes_list is the most flexible because it allows the user to create all the solutes individually with different settings. Different kwargs can be passed to each Solute when it is created (though I am not doing that here). Consistency of the start, step, stop, and solvents parameters is enforced.

water_H1 = u.select_atoms(...)
water_H2 = u.select_atoms(...)
solute_O = Solute(water_O, {"water": water})
solute_H1 = Solute(water_H1, {"water": water})
solute_H2 = Solute(water_H2, {"water": water})

water_solute = Solute.from_solutes_list([solute_O, solute_H1, solute_H2]) 
orionarcher commented 2 years ago

Just updated the documentation, once tests are passing, going to merge!