orionarcher commented 2 years ago

UPDATE:

This PR now implements a number of quality of life changes and solves issue #31. The proposed multi-atom solute changes will be implemented in another PR. See PR #72 for the new changes!

Todos

Notable points that this PR has either accomplished or will accomplish.

[x] add new testing data with a multi-atom solute
[x] remove foolish internal numbering scheme of solvated_atoms
[x] allow solutes to also act as solvents
[x] replace all column name strings with variables stored in a column_names.py file

Status

[x] Ready to go

Description

This is intended to be a major PR to handle multi-atom solutes. It relates to issues #47, #31, #66, and #58.

I would like to propose the following outline of new functionality. In this description, I will focus on the outward facing API. I'll use the somewhat trivial case of water as an example.

Solution will be renamed to Solute. All references to solute in the current documentation will be renamed to solvated atom or solvated atoms. I think this better captures what the Solute class really is, especially as we expand to multi-atom Solutes.
The default initializer for Solute will take only a single atom per residue. It will not support multiple identical atoms on a residue. This will be handled by the more general case. As a result, instantiating a Solute for a single atom remains the same. (note that I have already fixed the case with self-solvation identified in issue #31)

water = u.select_atoms(...)
water_O = u.select_atoms(...)
water_O_solute = Solute(water_O, {"water": water})

Additional initializers will be added to instantiate a Solute, these will support multi-atom solutes.

The first will allow the user to stitch together multiple solutes to create a new solute.

water_H1 = u.select_atoms(...)
water_H2 = u.select_atoms(...)
solute_O = Solute(water_O, {"water": water})
solute_H1 = Solute(water_H1, {"water": water})
solute_H2 = Solute(water_H2, {"water": water})

multi_atom_solute = Solute.from_solutes([solute_O, solute_H1, solute_H2])  # maybe this should be a dict?

The second will allow users to simply instantiate a solute from an entire residue (or part of a residue). There may be technical challenges here so this behavior is not guaranteed.

multi_atom_solute = Solute.from_residue(water, {"water": water})

To support this, the solvation_data dataframe will have two additional columns added, a "residue" column and a "solute_name" column. All analysis classes will be refactored to operate on the "residue" column rather than the "solvated_atom" column. This will make no difference for single-atom solutes but will allow the analysis classes to generalize easily. I'm not completely sure the "solute_name" column is necessary, but it would be convenient to have.
When a multi-atom solute is created all of the solvation_data dataframes from each constituent single-atom solute will be merged together. The "residue" column will group together solvated atoms on the same residue such that the analysis classes can operate on the whole solute. The API for accessing the residence classes will be identical.

multi_atom_solute.coordination_number["water"]   # valid property

We will retain all of the single atom Solutes as a property of the multi-atom Solute. This would amount to a rough doubling of the memory footprint, but it would make follow up analysis easier. I'm a bit torn here and there may be a better way.

>>> print(water.atoms)  # what should this be called?
>>> [solute_O, solute_H1, solute_H2]  # maybe this should be a dict?

For a single atom solute the atoms list would still be present but the data within would be identical to the solvation_data of the solute itself. Single atom solutes are now just a special case of multi-atom solutes.

water_O_solute.atoms[0].solvation_data = water_O_solute.solvation_data

I'm sure there are many things I am not considering that will come up later, but as a start, I think this plan will allow the package to be generalized with maximum code reuse. I'd love feedback or suggestions on any aspect of the outline above.

Todos

Notable points that this PR has either accomplished or will accomplish.

[x] add new testing data with a multi-atom solute
[x] remove foolish internal numbering scheme of solvated_atoms
[x] allow solutes to also act as solvents
[x] replace all column name strings with variables stored in a column_names.py file
[ ] make solutions composable so that multi-atom solutes can be constructed systematically
[ ] put guardrails in place to prevent misuse

Status

[ ] Ready to go

orionarcher commented 2 years ago

Hey @rkingsbury and @hmacdope, I just outlined my current plan for the solvation_analysis multi-atom rework and I'd love your feedback when you have a chance. Please let me know if anything is unclear or there is anywhere I should expand. This is the last major PR before v0.2 so I'd like to get it right.

Tests are failing right now but due to a relatively silly issue.

hmacdope commented 2 years ago

Looks awesome @orionarcher, I will need to go through in detail, will try to cover this week. Ping me again if I forget :)

rkingsbury commented 2 years ago

Looks good @orionarcher ; thanks for the well-thought out explanation. Here are a few comments on your proposed API.

Solution will be renamed to Solute. All references to solute in the current documentation will be renamed to solvated atom or solvated atoms. I think this better captures what the Solute class really is, especially as we expand to multi-atom Solutes.

The default initializer for Solute will take only a single atom per residue. It will not support multiple identical atoms on a residue. This will be handled by the more general case. As a result, instantiating a Solute for a single atom remains the same. (note that I have already fixed the case with self-solvation identified in issue make solute also be able to be a solvent #31)
water = u.select_atoms(...)
water_O = u.select_atoms(...)
water_O_solute = Solute(water_O, {"water": water})
Additional initializers will be added to instantiate a Solute, these will support multi-atom solutes.

The first will allow the user to stitch together multiple solutes to create a new solute.
water_H1 = u.select_atoms(...)
water_H2 = u.select_atoms(...)
solute_O = Solute(water_O, {"water": water})
solute_H1 = Solute(water_H1, {"water": water})
solute_H2 = Solute(water_H2, {"water": water})

multi_atom_solute = Solute.from_solutes([solute_O, solute_H1, solute_H2])  # maybe this should be a dict?
The second will allow users to simply instantiate a solute from an entire residue (or part of a residue). There may be technical challenges here so this behavior is not guaranteed.
multi_atom_solute = Solute.from_residue(water, {"water": water})

I feel that it might be confusing to have 3 init methods based on the solute type, esp. if 1 is a default Solute(...) and the other two require from_xxx methods. It's workable but not ideal. At a minimum, I would say add a Solute.from_atom that replicates the default __init__ behavior, just so there is consistency (i.e., you can always init a Solute using a from_xxx method). Even better in my opinion would be to modify the default __init__ method with type checking so that you can just use Solute(....) to init the object regardless of the type of residue.

Alternatively, perhaps it would make sense to due a superclass, e.g. MultiAtomSolute(Solute) and have its __init__ method handle the two cases you enumerate?

In general, I'd say whatever minimizes the number of classes/methods that a user has to become familiar with is the best choice for usability.



I'm not completely sure the "solute_name" column is necessary, but it would be convenient to have.

Yes I think this would be very useful.

We will retain all of the single atom Solutes as a property of the multi-atom Solute. This would amount to a rough doubling of the memory footprint, but it would make follow up analysis easier. I'm a bit torn here and there may be a better way.

Agreed; since you have to do the analysis anyway it makes sense to retain the data somehow. Note sure about the best way to achieve that. To address memory concerns, you could provide an optional kwarg (Default True) to retain this data.

>>> print(water.atoms)  # what should this be called?
>>> [solute_O, solute_H1, solute_H2]  # maybe this should be a dict?

What about water.solutes?

As for the return type, perhaps a dict of {solute_name: Solute}?

orionarcher commented 2 years ago

Thanks for the comments!

Even better in my opinion would be to modify the default init method with type checking so that you can just use Solute(....) to init the object regardless of the type of residue.

I quite like this idea. I think this is a much better solution than what I proposed. I'll plan to implement this with three from_x methods, as you suggested.

Alternatively, perhaps it would make sense to due a superclass, e.g. MultiAtomSolute(Solute) and have its init method handle the two cases you enumerate?

This is definitely something I've been thinking about. Probably the way I'd want to do it would be to have a SingleAtomSolute as the special case and then have Solute as the default object that users are working with. SingleAtomSolute could be a strictly internal representation that is stored in the Solute.atoms attribute. On one hand, this would allow for a cleaner separation of the "building block" usage of solute and the "full solute" usage of solute. But on the other hand, it might lead to some code duplication and unneeded complexity. For the from_solutes instantiation, I don't like the idea of users having to first work with a SingleAtomSolute and then switch to this Solute object. That alone might tip me towards a single class implementation, but I'd love to hear thoughts.

What about water.solutes? As for the return type, perhaps a dict of {solute_name: Solute}?

I like these both!

codecov[bot] commented 2 years ago

Codecov Report

Merging #70 (57133e3) into main (4972b64) will increase coverage by 0.12%. The diff coverage is 100.00%.

Additional details and impacted files

orionarcher commented 2 years ago

I just made several major quality of life changes that will have little impact on the substance of this PR. They will however, make the "Files changed" extremely chaotic. I am going to merge this PR and start a new one focused on the multi-atom solutes issues. I'll link back to this PR to maintain context and copy over the initial post. See PR #72 for the new changes.

hmacdope commented 2 years ago

Okie dokie, sorry I was a bit slow.

MDAnalysis / solvation-analysis

Enhancements to support composable Solutions and self-solvating solutes #70

UPDATE:

Todos

Status

Description

Todos

Status

Codecov Report