Add contexts to guess attributes better (especially elements and masses)

lilyminium commented 4 years ago

I started writing this as a comment in #2553 but it got, wow, really long. Let me know if I should move it there to carry on the discussion, or to #598.

Is your feature request related to a problem?

I think any effort to work with element-dependent stuff like finding hydrogens from H-bond donors (#2521) will be hampered by the element/type guesser, which is not good (...noting that I wrote the current version). Workarounds like looking at mass only work if the mass is not derived from an incorrectly guessed element. Few file formats directly provide element information, so it needs to be guessed from other information. In practice elements are only guessed from atom names, leading to all sorts of fun results.

Here I round up a bit of what has been discussed in the past and (re-)propose how it could be improved. This is more of an initial idea to gather suggestions than a real solution. Please let me know what you think!

Summary

The problem of guessing elements (and topology attributes in general) has come up many times (#598, #942, #1808, #2331, #2348, #2364, #2553). Some non-comprehensive history of discussion and current state of affairs:

masses and elements should be added to all topologies, and guessed if they are not in the file (currently: masses are guessed or read from every topology format. Elements are only in TOPParser)
- masses are too important for analysis methods to leave out
Currently we always guess mass if it's not found.
- If we use guess_masses to do it, we simply check if the input (an atom type or name) is an element. If not, the mass defaults to 0.0 and a warning is raised.
- If we use [guess_atom_mass(a.type) for a in ag], it passes the atom type to guess_atom_element() which does some string fudging to see if subsets of the string might be an element. If an element is not found, it just returns the original input, but with symbols stripped. No warning is raised by guess_atom_element, but validate_atom_types does.
In some cases where the file format might contain element information, we ignore it and still guess elements -> types -> masses from the name (e.g. PDBParser -- being fixed now) leading to predictable grief for, e.g. a calcium atom called CA
The element guesser does not work very well. e.g. it turns virtual sites into phosphorus It should be able to look at multiple sources of information at once.
The mandatory atom types attribute is populated by guessed elements in a significant number of parsers
Open question: What to do with CG/HMR/otherwise non-straightforward atomistic topologies?
- One option: have force-field aware guessers

In #2553 there has been interesting discussion of how to guess elements appropriately, and what we can infer from them (e.g. element <-> mass is no longer so straightforward, given HMR systems).

Describe the solution you'd like

We need a better element guesser (which then results in better guessing of related properties). With that solution, (imo) we should consider these:

Users should be able to actively choose whether to guess which properties
Users should know when attributes are guessed (via warnings)
File information should always be prioritised over guessing.
- What if the file information is non-physical, (e.g. atomic number of -1 or element of 'X' or negative mass?) For elements, @xiki-tempula suggests sticking that into the atom type if the atom type is not already known, and ignoring it if the atom type is defined.
we should be able to choose what to guess from -- names, masses, etc.

And these would be nice to have:

be easily callable after Universe/topology creation from files
be easily modifiable to return results for atomistic vs coarse-grained vs HMR vs any other funky new system
modifications are easily saved to file / repeatable
doesn't give me a million warnings

Proposal

This is just @jbarnoud and @mnmelo's proposal but with more spitballing --

Add a Context class:

Basically a database for straightforward element-type-mass-name-radii-etc relationships. Something like a Pandas dataframe rather than dicts, to support looking up any attribute by any other attribute, e.g.	Name	Element	Mass	Type
CA	C	12	C
CA	Ca	40.087	Ca

Should be easily subclassed or otherwise modified by users. Users should be able to add new attributes (columns) to the table, and attribute combinations (rows), with minimal fuss
Should be read/writable to file
- another thing for interoperability -- would be nice if we could read force fields from the MD software formats
Registered with the same metaclass trick as Parsers and Readers

Add more flexible guesser methods
- one guesser method per attribute, but accepts arbitrary attribute information (i.e. not just names)
- guessing of certain attributes (e.g. elements) is still done within individual parsers but users can pass in the context
- Warnings are implemented inside the guesser methods so we don't have to remember to put them in each parser.
- guesser methods guess a list of values (i.e. guess_elements vs. guess_element), so we only get the warning once, and it's probably faster

These could either be part of the Context class, be their own class, or (my favourite) be class methods of TopologyAttr. I don't think bundling them with Context is so helpful because most of the CG/HMR issues can be solved by just changing the values in the table. This also means we don't have to validate categorical values like elements ourselves, we just look them up in the table.

API

An API that matches add_TopologyAttr would be convenient for the user:

    >>> u = mda.Universe(PSF, DCD)  # <-- just a protein
    >>> u.guess_TopologyAttr('elements', from_attr=['names'], context='no-alpha-carbons')
    >>> u.atoms[u.atoms.names=='CA'].elements
    array(['Ca', 'Ca', ..., 'Ca'])
    >>> u.guess_TopologyAttr('elements', from_attr=['masses'], context='no-alpha-carbons')
    >>> u.atoms[u.atoms.names=='CA'].elements
    array(['C', 'C', ..., 'C'])

and developers when working with Parsers, which all contribute different information:

    el = Elements.guess_from(names=[...], resnames=[...])  # <-- no need to write guessed=True

Additional context

Below is an example implementation that could work well with that desired API. It is far from the real thing, just to show what I'm thinking.

Context class that contains the data and looks up close matches and stuff

able to look up any attribute from any combination of other attributes
can return close matches, again for any attribute, from any other attribute. e.g. get the element from a close match to the mass
would be nice: matching ranges of values (e.g. HE atom is H if mass is < 4)
may need different matching methods based on the type of the value passed in

class Context(metaclass=ContextMeta):

    df = pd.DataFrame(my_data)
    name = 'base'

    def lookup(self, attrname, strict=False, **from_attrs):
        """Return exact matches for variable number of attributes"""

        # get overlapping info
        info_attrs = [a for a in from_attrs if a in self.df.columns]
        if not info_attrs:
            raise ValueError('No valid guessing information given')

        # set up results
        values = np.array([from_attrs[a] for a in info_attrs]).T
        found = np.array([None]*values.shape[0], dtype=object)

        # group dataframe by info, get the first match for each combination
        columns = info_attrs+[attrname]
        first_df = self.df.groupby(*info_attrs, sort=False).first()
        first_arr = first_df[columns].values

        if strict:
            valid = np.ones_like(values, dtype=bool)  # match all exactly
        else:
            valid = (values!=None) & (first_arr!=None)  # or something like this
                                                        # that caters for charge==0 etc
        masked = values[valid]
        for i, row in enumerate(first_arr[valid]):
            matches = np.all(masked == row[:-1], axis=1)
            found[matches] = row[-1]
        return found

    def find_closest_match(self, attrname, value, return_attr=None, threshold=None):
        """Return close matches or None"""

        if return_attr is None:
            return_attr = attrname

        options = self.df[attrname].values
        if np.issubdtype(options.dtype, np.number):
            i = np.argmin(np.abs(options-value))
            # probably die if difference is greater than a % threshold

        elif np.issubdtype(options.dtype, np.character):
            found_idx = [[]]
            # try different capitalisation
            try_values = [value, value.capitalize(), value.upper(), value.lower()]

            # look for element names in the given value
            while len(found_idx[0]) == 0 and try_values:
                v = try_values.pop(0)
                start = np.char.find(options, value)
                found_idx = np.nonzero(start!=-1)
            if len(found_idx[0]) == 0:
                # not found, return None
                return

            # pick closest match (lowest number of other letters), then earliest
            remaining = np.str_len(options[found_idx]) - len(value)
            r, i_r, n_r = np.unique(remaining, return_counts=True, return_inverse=True)
            if n_r[0] > 1:
                lowest_indices = i_r[0]
                start_i = np.argmin(start[found_idx[lowest_indices]])
                i = found_idx[lowest_indices[start_i]]
            else:
                i = found_idx[i_r[0]]
        else:
            warnings.warn('What other dtypes does MDAnalysis have?')
            return

        return self.df[return_attr].values[i]

Example guessing method for Elements

This is actually pretty general and I guess there could be some base method where you pass in match_exact and match_similar. It takes both instantiated TopologyAttr objects and keyword-named arrays, and returns an Element instance.

@classmethod
def guess_from(cls, *args, context='base', **kwargs):
    info = dict((a.attrname, a.values) for a in args)
    for attrname, values in kwargs.items():
        info[attrname] = np.asarray(values)

    if not info:
        raise ValueError('Nothing to guess from')

    my_context = _CONTEXTS[context]

    n_atoms = len(next(iter(info.values())))
    values = np.array([None]*n_atoms, dtype=object)
    missing = np.ones_like(values, dtype=bool)
    info_used = set()

    # exact types > identifying masses/radii > names
    match_exact = [('types',),
                   ('masses', 'radii'),
                   ('names', 'resnames')]

    while np.any(missing) and match_exact:
        names = match_exact.pop(0)
        attr_info = dict((k, info[k][missing]) for k in names if k in info)
        if attr_info:
            found = my_context.lookup('elements', **attr_info)
            valid = found!=None
            if np.any(valid):
                values[missing][valid] = found[valid]
                info_used |= set(names)
                missing = values == None

    match_similar = ['names', 'types']

    while np.any(missing) and match_similar:
        name = match_similar.pop(0)
        if name in info:
            to_find, idx = np.unique(info[name][missing], return_inverse=True)
            matches = [my_context.find_closest_match('elements', n) for n in to_find]
            matches = np.array(matches)[idx]
            valid = matches!=None
            if np.any(valid):
                values[missing][valid] = matches[valid]
                info_used.add(name)
                missing = values == None

    warnings.warn(f'Guessed elements from {info_used}')  # one warning only
    if np.any(missing):
        warnings.warn(f'Some elements could not be found and are set to ""')
    return cls(values, guessed=True)

jbarnoud commented 4 years ago

I did not look at the code, but I find the proposal sound. Guessers have always been what frustrates me the most in mdanalysis.

One thing to be aware of is that some contexts will require maintenance and careful versioning. I think especially about a context for the martini coarse grained force field : to guess masses from a gro file, you need your context to know the mapping for all the residues in your universe. Yet, those get updated from time to time. Similar issues will come with united atom force fields.

The context should be as easy as possible to create, extend, and modify without editing the code of mdanalysis.

lilyminium commented 4 years ago

@jbarnoud Thanks for having a look!

One thing to be aware of is that some contexts will require maintenance and careful versioning.

Yes, this is why I think being able to read from a file (preferably straight from a force field file) would be important. In that scenario the user would ideally be to be able to override whatever masses are generated from the parser with:

>>> context = Context.from_files('martini.itp', 'martini_aa.itp',
                                 'martini_ions.itp', name='my_martini')
>>> u = mda.Universe('martini.gro')
>>> u.guess_TopologyAttr('masses', from_attrs=['names', 'resnames'], 
                         context='my_martini')

perhaps prompted by a helpful warning message.

I don't think it would be practical to try to keep up with force field updates.

richardjgowers commented 4 years ago

I like this idea. To play devil's advocate, I'm not keen on adding more lines to load a Universe - part of what the package does is try and hide complexity, ie masses are magically guessed often. But maybe it's cleaner and more correct to not guess anything by default to force users to understand what is from their file and what is derived information.

lilyminium commented 4 years ago

@richardjgowers as masses is a mandatory attribute, parsers would guess it themselves. guess_TopologyAttr just overwrites whatever values previously existed. For magical guessing, if we write the mass guesser to also consider residue names when present:

>>> context = Context.from_files('martini.itp', 'martini_aa.itp',
                                 'martini_ions.itp', name='my_martini')
>>> u = mda.Universe('martini.gro', context='my_martini')

Under the hood, GROParser dumps all the available topology attributes into Masses.guess_from, and Masses.guess_from then chooses which ones to use and how. Probably in order:

-> match elements if valid, else
---> match types if valid, else
-----> match atom names [and resnames if present], else
-------> 0.0

If the user chooses not to load a new context, the result would wind up being the status quo with masses of 0.0 (assuming MARTINI particles aren't in base.)

>>> u = mda.Universe('martini.gro', context='base')

Edit: I agree that masses are important and should be guessed, and this may also apply to elements -- but I think a way to easily guess using different information to the tables in guessers, and to easily overwrite the attribute with a guess that the user manually makes, would be really convenient.

richardjgowers commented 4 years ago

So would it make sense for the context to do the guessing?

ie

context = Context.from_files(...)
masses = context.guess_masses(topology)  # modifies topology in place?

Where the default .guess_masses is just our current function

xiki-tempula commented 4 years ago

MARTINI all have either 72 (which is the case for non-polarizable version) or 36 and 72 for (polarizable version). The particle which has a mass of 36 can be easily picked up by the presence of the name.

for residue in protein.select_atoms('name D').residues:
    if residue.resname == 'LYS':
       residue.atoms.select_atoms('name Qd D').masses = 36
    elif residue.resname == 'THR':
       residue.atoms.select_atoms('name N0 D').masses = 36

jbarnoud commented 4 years ago

>>> context = Context.from_files('martini.itp', 'martini_aa.itp',
                                 'martini_ions.itp', name='my_martini')
>>> u = mda.Universe('martini.gro')
>>> u.guess_TopologyAttr('masses', from_attrs=['names', 'resnames'], 
                         context='my_martini')

I really like that. I would just advocate to ditch the name here, and to refer directly to the object. Just for the sake of reducing complexity.

lilyminium commented 4 years ago

So would it make sense for the context to do the guessing?

It could! I left it as a classmethod because I like x.from(y), but it’s probably easier to subclass and modify as a Context method. However my method allows users to match values that are not in the topology by passing it in themselves, which I thought might matter in some way.

I would just advocate to ditch the name here, and to refer directly to the object. Just for the sake of reducing complexity.

Sure. I would leave the name as an option just because it follows the topology format and because MDAnalysis might like to offer some default contexts.

jbarnoud commented 4 years ago

MARTINI all have either 72 (which is the case for non-polarizable version) or 36 and 72 for (polarizable version). The particle which has a mass of 36 can be easily picked up by the presence of the name.
for residue in protein.select_atoms('name D').residues:
    if residue.resname == 'LYS':
       residue.atoms.select_atoms('name Qd D').masses = 36
    elif residue.resname == 'THR':
       residue.atoms.select_atoms('name N0 D').masses = 36

This is not completely accurate. The "small" beads used (mostly) in rings have a mass of 45 amu. You need to know the topology of the residue to know which beads are small. Added to that, the upcoming version of the force field has 3 bead sizes with different masses and many more molecules. But it is just an example, other cg force fields do have different masses for each particles.

zemanj commented 4 years ago

tl;dr: We should not guess anything by default.

I'd like to take up what @richardjgowers said:

But maybe it's cleaner and more correct to not guess anything by default to force users to understand what is from their file and what is derived information.

In my opinion, this is not a "maybe". It may appear like a nice convenience feature if properties can be guessed automatically. However, there is no scientifically sound way to get around actually providing the correct numbers to obtain reliable data. Results obtained with the help of MDAnalysis are regularly published in scientific papers. Any such result must not rely on educated guesses but on facts. And guessing masses is not "deriving". It is guessing. After all, we cannot know for sure which masses/charges/whatever were used in the simulation unless the user provides this information. Assuring that people do their research properly is certainly not our responsibility as MDAnalysis developers. However, it should be our goal to enable people to do things correctly. In that respect, guessers are a neat feature, but from a scientific standpoint they're rather dangerous. Don't get me wrong, I think we should definitely have smart guessers. I just think they really should be disabled by default, and we should take great care letting the user know that guessed properties must be carefully checked.

MDAnalysis / mdanalysis