Closed orionarcher closed 2 years ago
Hi @hmacdope and @rkingsbury, I wanted to seek feedback on one specific issue. I am planning to rename several columns in solvation_data
and I wanted more eyes to make sure it's sensible.
In particular I am removing references to ix
in favor of a simpler nomenclature. This is hopefully a bit more approachable while still distinguishing what everything is. Of course, the details will be in the documentation.
frame == frame NEW COLUMN = "solute" solvated_atom => solute_atom atom_ix => solvent_atom dist => distance NEW COLUMN => solute_name resname => solvent_name res_ix => solvent
Hi @hmacdope and @rkingsbury, I wanted to seek feedback on one specific issue. I am planning to rename several columns in
solvation_data
and I wanted more eyes to make sure it's sensible.In particular I am removing references to
ix
in favor of a simpler nomenclature. This is hopefully a bit more approachable while still distinguishing what everything is. Of course, the details will be in the documentation.frame == frame NEW COLUMN = "solute" solvated_atom => solute_atom atom_ix => solvent_atom dist => distance NEW COLUMN => solute_name resname => solvent_name res_ix => solvent
@orionarcher I think this looks good. Looking only at the column names, it's not immediately clear that solute
, solute_atom
, solvent
, and solvent_atom
are going to be numbers (indices) but I think that will be clear in context.
An alternative that would make this slightly more explicit would be to drop the _name
suffix for the str fields but use _idx
for all the numerical fields, e.g.
NEW COLUMN = "solute_idx" solvated_atom => solute_atom_idx atom_ix => solvent_atom_idx NEW COLUMN => solute resname => solvent res_ix => solvent_idx
I don't have a strong opinion one way or another though
Solution
will be renamed toSolute
. All references tosolute
in the current documentation will be renamed tosolvated atom
orsolvated atoms
. I think this better captures what theSolute
class really is, especially as we expand to multi-atomSolute
s.
I think this makes sense but if so what does a multi-solute thing become, a solute collection? If the Solute class encompasses multiple species is the nomenclature then slightly confusing? Up to you at the end of the day.
- The default initializer for
Solute
will take only a single atom per residue. It will not support multiple identical atoms on a residue. This will be handled by the more general case. As a result, instantiating a Solute for a single atom remains the same.
Yeah that make sense to me.
water = u.select_atoms(...) water_O = u.select_atoms(...) water_O_solute = Solute(water_O, {"water": water})
- Additional initializers will be added to instantiate a
Solute
, these will support multi-atom solutes.The first will allow the user to stitch together multiple solutes to create a new solute.
water_H1 = u.select_atoms(...) water_H2 = u.select_atoms(...) solute_O = Solute(water_O, {"water": water}) solute_H1 = Solute(water_H1, {"water": water}) solute_H2 = Solute(water_H2, {"water": water}) multi_atom_solute = Solute.from_solutes([solute_O, solute_H1, solute_H2]) # maybe this should be a dict?
I like the new construction pattern but again my first instinct that there is no separation between Solute / collection of solutes I find slightly confusing. Perhaps I am misinterpreting what is meant by solute.
The second will allow users to simply instantiate a solute from an entire residue (or part of a residue). There may be technical challenges here so this behavior is not guaranteed.
multi_atom_solute = Solute.from_residue(water, {"water": water})
- To support this, the
solvation_data
dataframe will have two additional columns added, a "residue" column and a "solute_name" column. All analysis classes will be refactored to operate on the "residue" column rather than the "solvated_atom" column. This will make no difference for single-atom solutes but will allow the analysis classes to generalize easily. I'm not completely sure the "solute_name" column is necessary, but it would be convenient to have. Sure!- When a multi-atom solute is created all of the
solvation_data
dataframes from each constituent single-atom solute will be merged together. The "residue" column will group together solvated atoms on the same residue such that the analysis classes can operate on the whole solute. The API for accessing the residence classes will be identical.multi_atom_solute.coordination_number["water"] # valid property
Yep this solves the multi atom issue nicely.
- We will retain all of the single atom
Solute
s as a property of the multi-atomSolute
. This would amount to a rough doubling of the memory footprint, but it would make follow up analysis easier. I'm a bit torn here and there may be a better way.>>> print(water.atoms) # what should this be called? >>> [solute_O, solute_H1, solute_H2] # maybe this should be a dict?
Is water.atoms meant to be the solutes that underly a solute collection solute? If so i would say call it "components" or somthing more geenral to avoid a collision with the MDA pattern u.atoms -> AtomGroup
For a single atom solute the
atoms
list would still be present but the data within would be identical to thesolvation_data
of the solute itself. Single atom solutes are now just a special case of multi-atom solutes.water_O_solute.atoms[0].solvation_data = water_O_solute.solvation_data
Hi @hmacdope and @rkingsbury, I wanted to seek feedback on one specific issue. I am planning to rename several columns in
solvation_data
and I wanted more eyes to make sure it's sensible. In particular I am removing references toix
in favor of a simpler nomenclature. This is hopefully a bit more approachable while still distinguishing what everything is. Of course, the details will be in the documentation. frame == frame NEW COLUMN = "solute" solvated_atom => solute_atom atom_ix => solvent_atom dist => distance NEW COLUMN => solute_name resname => solvent_name res_ix => solvent@orionarcher I think this looks good. Looking only at the column names, it's not immediately clear that
solute
,solute_atom
,solvent
, andsolvent_atom
are going to be numbers (indices) but I think that will be clear in context.An alternative that would make this slightly more explicit would be to drop the
_name
suffix for the str fields but use_idx
for all the numerical fields, e.g.NEW COLUMN = "solute_idx" solvated_atom => solute_atom_idx atom_ix => solvent_atom_idx NEW COLUMN => solute resname => solvent res_ix => solvent_idx
I don't have a strong opinion one way or another though
Agree with @rkingsbury that it would be nice to have a suffix that indicates that something is likely to be numerical vs str based but otherwise i like the new simpler nomenclature.
I think this makes sense but if so what does a multi-solute thing become, a solute collection? If the Solute class encompasses multiple species is the nomenclature then slightly confusing? Up to you at the end of the day.
A multi-atom Solute is still just a Solute. The ONLY api difference between a single-atom Solute and a multi-atom Solute is how many Solutes are in the solute.atoms
property (though I should rename). For a single-atom Solute, solute.atoms
is a dict with only itself.
solute = Solute.from_atoms(solute, solvents, solute_name="solute")
assert solute.atoms["solute"] is solute # True
Whereas for a multi-atom solute the solute.atoms attribute contains a single-atom Solute for every atom in the Solute. It's Solutes all the way down.
When looking at solvation of a multi-atom solute it's common to want to look at both atom-level attributes, like coordination around a specific functional group, and residue-level attributes, like formation of coordination networks. I like that this provides an identical API for doing both. It's true that some attributes make more sense to look at on a single-atom level, like coordination and speciation, whereas others make sense to look at on a residue level, like pairing and networking. Nonetheless, I'd prefer to leave the distinction up to the user and not impose a specific way of looking at the Solute. Does that make sense?
If its confusing to you it will almost certainly be confusing to users so I think it should be well-justified.
Is water.atoms meant to be the solutes that underly a solute collection solute? If so i would say call it "components" or somthing more geenral to avoid a collision with the MDA pattern u.atoms -> AtomGroup
Maybe solute.solutes
as rkingsbury suggested or solute.atom_solutes
?
Agree with @rkingsbury that it would be nice to have a suffix that indicates that something is likely to be numerical vs str based but otherwise i like the new simpler nomenclature.
Yeah I agree, perhaps I'll go with the following. Very explicit but still reasonably short.
frame == frame NEW COLUMN = solute_res_ix solvated_atom => solute_atom_ix atom_ix => solvent_atom_ix dist => distance NEW COLUMN => solute resname => solvent res_ix => solvent_res_ix
Update:
I've pretty much finished the multi-atom solute functionality as described above. I've got a decent amount of testing complete as well with CodeCov approaching 100%.
What remains is a lot of work on the documentation. The new API is a bit more complicated and the old documentation wasn't great in the first place. I'll probably do a bare minimum rewrite of the docs and tutorials and save more extensive documentation rewrites for another PR.
A few notes on where things landed:
I implemented multi-atom solutes by creating a private _run_solute_atoms
method that gets monkey-patched over run
when the solute has more than one atom.
There are now three constructors, all three work for both single-atom and multi-atom solutes, it's just a matter of whether they use the default run
or the monkey-patched run
.
The from_atoms
constructor takes an AtomGroup
. This is the least flexible constructor and does the most work behind the scenes.
water = u.select_atoms("resname water")
water_O = water.select_atoms("element O")
water_O_solute = Solute.from_atoms(water_O, {"water": water})
# OR
water_solute = Solute.from_atoms(water, {"water": water})
The from_atoms_dict
constructor takes a dict of solute names (strings) and AtomGroups
. Each AtomGroup
needs to only have one atom per solute, which is enforced by assertions. this provides an intermediate level of flexibility because it allows you to name your atom solutes.
water_solute = Solute.from_atoms_dict({"water_O": water_O, "water_H1": water_H1, "water_H2": water_H2}, {"water": water)
Finally, the from_solutes_list
is the most flexible because it allows the user to create all the solutes individually with different settings. Different kwargs can be passed to each Solute when it is created (though I am not doing that here). Consistency of the start, step, stop, and solvents parameters is enforced.
water_H1 = u.select_atoms(...)
water_H2 = u.select_atoms(...)
solute_O = Solute(water_O, {"water": water})
solute_H1 = Solute(water_H1, {"water": water})
solute_H2 = Solute(water_H2, {"water": water})
water_solute = Solute.from_solutes_list([solute_O, solute_H1, solute_H2])
Just updated the documentation, once tests are passing, going to merge!
Description
This PR was split off from PR #70 to better separate quality-of-life changes from API changes. See that PR for some history and discussion.
This is intended to be a major PR to handle multi-atom solutes. It relates to issues #47, #66, and #58.
I would like to propose the following outline of new functionality. In this description, I will focus on the outward facing API. I'll use the somewhat trivial case of water as an example.
Solution
will be renamed toSolute
. All references tosolute
in the current documentation will be renamed tosolvated atom
orsolvated atoms
. I think this better captures what theSolute
class really is, especially as we expand to multi-atomSolute
s.The default initializer for
Solute
will take only a single atom per residue. It will not support multiple identical atoms on a residue. This will be handled by the more general case. As a result, instantiating a Solute for a single atom remains the same.Solute
, these will support multi-atom solutes.The first will allow the user to stitch together multiple solutes to create a new solute.
The second will allow users to simply instantiate a solute from an entire residue (or part of a residue). There may be technical challenges here so this behavior is not guaranteed.
To support this, the
solvation_data
dataframe will have two additional columns added, a "residue" column and a "solute_name" column. All analysis classes will be refactored to operate on the "residue" column rather than the "solvated_atom" column. This will make no difference for single-atom solutes but will allow the analysis classes to generalize easily. I'm not completely sure the "solute_name" column is necessary, but it would be convenient to have.When a multi-atom solute is created all of the
solvation_data
dataframes from each constituent single-atom solute will be merged together. The "residue" column will group together solvated atoms on the same residue such that the analysis classes can operate on the whole solute. The API for accessing the residence classes will be identical.Solute
s as a property of the multi-atomSolute
. This would amount to a rough doubling of the memory footprint, but it would make follow up analysis easier. I'm a bit torn here and there may be a better way.For a single atom solute the
atoms
list would still be present but the data within would be identical to thesolvation_data
of the solute itself. Single atom solutes are now just a special case of multi-atom solutes.I'm sure there are many things I am not considering that will come up later, but as a start, I think this plan will allow the package to be generalized with maximum code reuse. I'd love feedback or suggestions on any aspect of the outline above.
Todos
Notable points that this PR has either accomplished or will accomplish.
Status