MDAnalysis / mdanalysis

MDAnalysis is a Python library to analyze molecular dynamics simulations.
https://mdanalysis.org
Other
1.32k stars 652 forks source link

How to use information in topology files #942

Closed kain88-de closed 7 years ago

kain88-de commented 8 years ago

It has come up in #941 and #815 that we do have a problem with abstracting the away the differences between topology formats because they save different information. There are in two types of problems

1. Set of common properties In #815 I noticed that I right now write code under the assumption that a field like masses is always available even if they are not specified in the topology (eg PDB). While the new topology system only provides data actually present in the topology.

2. properties specific to a format In #941 @jbarnoud noted that he would like to make all information in a topology available to the user. This means going away from being completely format independent.

To me this suggests that we need to make a list of properties that are always available independent of the format and a special property under which we can add format specific properties like so: ag.special.format_properties.

What do others think for this? Should we have properties like masses, charges always available with a guess function for some format?

jbarnoud commented 8 years ago

Since you mention guesses, I link #507 for the record.

orbeckst commented 8 years ago

In general I like being able to work with reasonable automatic guesses if

  1. code informs me (or fails properly) when this cannot be done confidently (we as the developers have to discuss what we consider "confidently")
  2. I have the opportunity to override or provide information to make sure I get exactly what I want, e.g. if I care about calculating dipole moments I must make sure that I get the charges that I consider correct
  3. ideally the code informs me about how to do 2 if 1 cannot be safely done
richardjgowers commented 8 years ago

I think I'm in very much favour of 2. If someone provides a file full of information we should make it all available. Some stuff might not work, but as long as this is properly documented (ie what you need to provide for each calculation) it should be fine. If you only provide an XYZ as your topology, you can't go crying when Resname based stuff doesn't work!

WRT masses, I think we've removed that currently in 363 just because we've not implemented guessers yet. I'd happily backtrack on this because they can be guessed with a fair degree of confidence

kain88-de commented 8 years ago

If you only provide an XYZ as your topology, you can't go crying when Resname based stuff doesn't work!

But we should fail with an error message saying that this isn't available information in the chosen topology instead of just throwing an attribute error.

Another thing is this makes code fragile that depends on MDAnalysis. We should have a defined set of values that have to exist independent of the topology and a set of values that are guessed when non are found. This will help a lot for people who write their own libraries on top of MDAnalysis.

richardjgowers commented 8 years ago

So we could hack into the __getattr__ a list of known possible failures, so if resname fell through to here we could provide a better error message. I don't like the current practice of setting the default segid to "SYSTEM", I'd rather have try/excepts when segids are to be used.

kain88-de commented 8 years ago

I think we should start a wiki page where we document names and what default behavior we expect from them. A wiki page might be easier to handle as an overview than a long issue thread. The discussion can still happen here.

mnmelo commented 8 years ago

In #598 I'm working on implementing the guessers for #363. Some decisions must be made at this point, even after choosing option 2. above. I need feedback for the following:

richardjgowers commented 8 years ago

I guess a good starting point is to try and follow the previous api. So all attributes that had to exist before are handled gracefully when they do not exist. This can probably get done through hacking into __getattr__ and adding a list of special error messages.

Eg if I try and access .resnames from my xyz file, I should get a special message regarding their absence. I prefer this over a placeholder entry, as it doesn't fill up the namespace unnecessarily.

WRT masses, if these are guessed by default, they should still raise some sort of warning on Universe creation (Warning: Guessing masses according to 'Gromos' basis)

Bonds/Angles/Dihedrals are kind of a special case as they don't belong to the Atoms in the same way as masses, so they deserve their own first class position like residues and segments have. If this is then an empty container, that then makes idioms like if u.bonds: work intuitively.

mnmelo commented 8 years ago

I don't agree with the if u.bonds construct by default. It seems pythonic but doesn't tell you whether the container is empty because it wasn't guessed or empty because you have a Lennard-Jones fluid system that really has no bonds.

I'd propose to let the user decide whether to guess bonds or not at Universe creation time (or later). For the case of topologies that do provide bond information, we might include a kwarg that allows the user to specify which bonds to keep: the topology's, the guessed ones, or one complementing the other. After this, if u.bonds will work just the same, but much more meaningfully.

mnmelo commented 8 years ago

Anyway, summarizing this thread: attributes that are absent from the topology at Universe creation time can fall in one of four categories:

  1. Left out of the Topology, with an AttributeError on accession (like the specialized attributes from specific formats @jbarnoud proposes to keep);
  2. Left out of the Topology, but with a nice error on accession, perhaps suggesting how to set/guess values (like resnames);
  3. Added to the topology, set to a dumb default (like currently bonds for .gro or resids for .xyz);
  4. Added to the topology, guessed, possibly with a warning.

Let's then decide which attributes go into each category? What's the best way to gather a consensus vote for this, a shared Google Docs spreadsheet or @kain88-de's suggestion for a Wiki page?

richardjgowers commented 8 years ago

I guess make a wiki page so it's all on the repo still

mnmelo commented 8 years ago

There, I made a wiki page at https://github.com/MDAnalysis/mdanalysis/wiki/Topology-attribute-handling Let's see how it goes.

mnmelo commented 8 years ago

My rationale for the classification I already added was to have a few attributes with VIP status, like mass, charge, and element, which we can expect in a format-independent manner. These are either guessed or left missing if guessing fails. Bond guessing also follows this.

Most other attributes can either be safely set to defaults, or, for the more format-specific ones, left missing.

orbeckst commented 8 years ago

Thanks for the page @mnmelo. I added my categories and and also added a Comments heading below. Maybe paste your comment regarding your rationale into the comments section, too?

Btw, tempfactors and bfactors are the same thing. I'd keep bfactors because this is what we used in the past.

On 5 Sep, 2016, at 11:02, mnmelo notifications@github.com wrote:

My rationale for the classification I already added was to have a few attributes with VIP status, like mass, charge, and element, which we can expect in a format-independent manner. These are either guessed or left missing if guessing fails. Bond guessing also follows this.

Most other attributes can either be safely set to defaults, or, for the more format-specific ones, left missing.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Oliver Beckstein * orbeckst@gmx.net skype: orbeckst * orbeckst@gmail.com

orbeckst commented 8 years ago

@mnmelo can you shepherd the discussion towards a conclusion?

mnmelo commented 8 years ago

Sure! I was waiting for @jbarnoud's input, but he just told me that he's somewhere in between my and @richardjgowers' opinion, so let's leave him as an advisor in case we can't reach a consensus in some cases.

I appreciate your opinions, and patience reading this relatively long post.

As a summary: there is mostly agreement on how to proceed with missing attributes, but there's also some divergent points. As a first approach I assigned the categories democratically, noting the minority reports when appropriate (see below). A number of points might warrant some attribute reassignment:

  1. @richardjgowers raised the point that bond guessing might be too expensive for some systems to be done by default. This of course propagates to other bonded attribute guessing (angles and dihedrals). As per his suggestion we could limit the number of atoms where guessing is done by default, which will put bonds/angles/dihedrals definitely in cat. 5. Speeding up bond guessing will also be a good use of the gridded neighbor search feature, when we get that working in MDAnalysis (see #974).
  2. Another issue that comes up is guesser dependency. Guessing angles, for instance, requires knowing bonds. In turn, bonds guessing requires knowing VdW radii. Which requires knowing types or elements; and so forth. This means that all attributes in the dependency line should be guessable. Should we put all possible dependencies into cat. 5? Or just let the higher order guesser handle it itself? (In this example this would mean that the bond guesser would have the task of solving dependencies).
  3. Some formats already do some guessing on their own. This premature guessing should be replaced by a centralized approach, for a consistent experience. As an example, the HOOMD XML format guesses names and elements from the types attribute.
  4. As a mix of the above points, there might be attributes we do want to guess conditionally. It seems reasonable to set names from types by default, if the former is absent. The reverse is also a good idea (setting types from names, when absent), and elements from any of these. Should then names, types, and elements all be cat. 5?
  5. Finally, some features not explicitly decided on but that I vote for:
    • cat. 2 attributes should always exist (meaning that dir(atomgroup) shows them, and also IPython's tab completion);
    • cat. 5 (guessed) attributes should also always exist, but guessing be made lazy: bonds would only be guessed if accessed, etc. This would mitigate the performance problem raised by @richardjgowers.

Category assignment

Here's the 3-vote democratic category assignment on how to handle missing attributes (see the wiki page for category reference; not all categories were used). Interestingly, @orbeckst always sided with the majority:

1. Left out of the Topology, with a general AttributeError on accession

2. Left out of the Topology, but with a specific MDAnalysis.MissingTopologyAttributeError on accession, perhaps suggesting how to set/guess values

3. Added to the topology, set to a default

5. Added to the topology, guessed — possibly with a warning that it is guessed. If guess fails, warn and leave unassigned (like cat. 2 above)

7. Internal: attribute generated internally that does not depend on topology availability or guessing

richardjgowers commented 8 years ago

Ok I'll try and follow this as a squash more tests.

dotsdl commented 8 years ago

@mnmelo I'm in agreement with this summary. Thanks for taking the time to compile this in a straightforward way. I think the business of figuring out how missing attributes should be handled is really quite fuzzy, but this is very clear despite that.

jbarnoud commented 8 years ago

Great! I may fall with the minority reports for 'types' and 'radii', but they are the two attributes I care the less about so you can safely ignore me.

I have two questions, though:

mnmelo commented 8 years ago

So, this is what I meant in point 2 above, and I need your feedback to decide: either we make element cat. 5, meaning we'll try to automatically guess it when needed, too, or we let the radii guesser do that job (see below for a comment on type).

I have also given some thought to your other question. One can always use the pythonic way of try:/except AttributeError:. But in practice this might not always be desirable. Having something like atomgroup.bonds.exists won't work because .bonds is hidden behind a @property decorator and we'd be querying the return value for exists instead. The cleanest I could come up with is to query the topology directly, as in universe.topology.bonds.exists or even give the TopologyAttr boolean nature: if universe.topology.bonds: .... Again, feedback wanted.

About types and elements. I'm not sure types depend univocally on the elements: a carbon atom in GROMOS can be many different types. Also, I have the feeling that the information content of type is not consistent across formats or even forcefields (think Martini).

richardjgowers commented 8 years ago

Any guessed attributes will have to issue a warning on Universe creation. I guess for debug purposes I could also add the fact that the attributes were guessed into the ToplogyAttr object itself too.

I'm not sure on failed guesses, you could set them to a default (ie charges = 0.0?) but I prefer getting an actual NoDataError rather than thinking you have data when really it's just junk. So inside a parser you'd have something like

warnings.warn("Trying to guess masses...")
try:
    masses = guess_masses(atoms)
except NoDataError:
    warnings.warn("Couldn't guess masses, sorry")
else:
    attrs.append(masses)  # masses only enter Topology if guessed

Ultimately there'll have to be a bit of defensive programming at the start of analysis to check that all the required attributes exist.. eg

def my_amazing_mass_thing(atoms):
    try:
        masses = atoms.masses
    except NoDataError:
        raise ValueError("Sorry, I need masses")

We could even make decorators that do this @requires('masses', 'charges') to reduce boilerplate.

jbarnoud commented 8 years ago

If a guess fails, I expect to get a NoDataError, default dummy values are utterly misleading. If an attribute on category 2 fails to be guessed, I would like the error to reflect it. My concern is the poor user that get the MissingTopologyAttributeError, run the guesser as suggested, only to get the MissingTopologyAttributeError again.

either we make element cat. 5, meaning we'll try to automatically guess it when needed, too, or we let the radii guesser do that job (see below for a comment on type).

types is very format and force field specific. In most cases, I would not trust it anyway. Hence my lack of opinion about it. The same goes for radii to a lesser extend.

If the masses remain in category 5, then elements should be there too as masses will most likely pull element with it. It should actually pull types, but types is meaningless without context.

Something like atomgroup.bonds.exists would not work for category 1. I would go for a universe.topology.gessed_attributes list so I can do something like 'bonds' in universe.topology.guessed_attributes if I want to know if bonds are guessed. A universe.topology.read_attributes may be useful too.

I like @richardjgowers decorator.

mnmelo commented 8 years ago

And what's your take on lazy guessing? That way we only guess when accessed (it avoids spurious warnings on Universe loading about attributes that have nothing to do with the analysis at hand).

jbarnoud commented 8 years ago

I like lazy guessing! It makes it even more interesting to know before hand if an attribute exists, though. If I know bond guessing can be slow and I can do without it if needed, then I want to check bonds are there before I trigger the guesser unexpectedly.

mnmelo commented 8 years ago

@jbarnoud: The idea is that failing guessers raise warnings. This way the poor user will ave some more feedback.

Regarding cat. 1 vs. cat 2. behavior: cat. 2 are VIP attributes that we decided are always present. Cat. 1 attributes (icodes, chocolate_propensity and such) have to be handled in a try:/except: way. The difference comes with the classification we introduced, otherwise everyone is a VIP and we go back to square 1.

We can think of an attribute state as well, which would solve your knowing-if-there-are-bonds-beforehand problem:

u = mda.Universe(top, traj)
# This just tells you whether bonds exist and won't blow up with a
#  MissingTopologyAttributeError in your face if they don't.
if u.topology.bonds: 
    ...
# If there are no bonds I'm not sure whether this should return False
#  or raise an AttributeError.
if u.topology.bonds.guessed:
    ...
# Non VIP attributes need a different approach:
try:
    attr = u.topology.chocolate_propensity
except AttributeError:
    ...
#or
if hasattr(u.topology, 'chocolate_propensity'):
    ...

What do you think?

mnmelo commented 8 years ago

Also, to help the end user in the non-VIP cases, we can have a check in AtomGroup.__getattr__ to see if a failed access points to a known attribute name, but absent from the current topology. Something like:

AttributeError: Access to attribute 'icode' was attempted, but the currently used
topology format (GRO, from 'filename.gro') doesn't provide it. This attribute is
typically provided by the PDB and so_and_so formats.

and

AttributeError: Access to attribute 'icode' was attempted, but the currently used
topology file doesn't provide it. The used topology format (PDB) can provide the
'icode' attribute, but that field is empty in 'filename.pdb'.

Looks fancy but all that's really needed is a table linking formats and non-VIP attribute names.

jbarnoud commented 8 years ago

@jbarnoud: The idea is that failing guessers raise warnings. This way the poor user will ave some more feedback.

Hopefully it will be enough to prevent a poor user entering an infinite loop. Your fancy exception are more inline with what I had in mind.

Regarding cat. 1 vs. cat 2. behavior: cat. 2 are VIP attributes that we decided are always present. Cat. 1 attributes (icodes, chocolate_propensity and such) have to be handled in a try:/except: way. The difference comes with the classification we introduced, otherwise everyone is a VIP and we go back to square 1.

I guess you are answering to:

Something like atomgroup.bonds.exists would not work for category 1.

My point is that keeping a list of the attribute that are read from the topology and of those that are guessed would be an easy way to see what happens with the topology, independently from the categories. Dealing with exceptions will of course have to deal with the categories.

mnmelo commented 8 years ago

Ok, I guess that's possible: a list for what's from the topology file and a list for what's guessed. Technically, we can have attributes from the topology that we later override by guessing/setting by hand. Should this be in both lists?

Or should the lists be slightly different? As in: a list for every available attribute (irrespective of how it was added to the topology) and a list for which of those were guessed.

jbarnoud commented 8 years ago

Or should the lists be slightly different? As in: a list for every available attribute (irrespective of how it was added to the topology) and a list for which of those were guessed.

topology.available_attributes would be topology.read_attributes + topology.guessed_attributes. It does not cost much to have it.