MDAnalysis / mdanalysis

MDAnalysis is a Python library to analyze molecular dynamics simulations.
https://mdanalysis.org
Other
1.3k stars 648 forks source link

Implement tagged smarts matching #3348

Open lilyminium opened 3 years ago

lilyminium commented 3 years ago

Is your feature request related to a problem?

OpenForceField has a really nice function called chemical_environment_matches that returns only the tagged atoms in a SMARTS substructure search. e.g. If you passed '[#6]1:[#6]:[#6:4]:[#6:3]:[#7:2]:[#6:1]:1', it would search the entire substructure but only return the tagged atoms, and ordered by the tag number.

Describe the solution you'd like

Instead of going openff.Molecule.from_rdkit(atoms.convert_to("RDKIT")).chemical_environment_matches("my smarts query"), it would be super handy to implement it directly in MDAnalysis. The way OpenFF does it is just by looking at the GetAtomMapNum()s of the smarts molecule; it looks fairly straightforward. Their code is below (MIT license):

# Set up query.
        qmol = Chem.MolFromSmarts(smirks)  # cannot catch the error
        if qmol is None:
            raise ValueError(
                'RDKit could not parse the SMIRKS string "{}"'.format(smirks)
            )

        # Create atom mapping for query molecule
        idx_map = dict()
        for atom in qmol.GetAtoms():
            smirks_index = atom.GetAtomMapNum()
            if smirks_index != 0:
                idx_map[smirks_index - 1] = atom.GetIdx()
        map_list = [idx_map[x] for x in sorted(idx_map)]

        # Perform matching
        matches = list()

        # choose the largest unsigned int without overflow
        # since the C++ signature is a uint
        max_matches = np.iinfo(np.uintc).max
        for match in rdmol.GetSubstructMatches(
            qmol, uniquify=False, maxMatches=max_matches, useChirality=True
        ):
            mas = [match[x] for x in map_list]
            matches.append(tuple(mas))

        return matches

Describe alternatives you've considered

Don't do that.

Additional context

IAlibay commented 3 years ago

I might be misunderstanding, but can't we already do something similar through select_atoms?

lilyminium commented 3 years ago

select_atoms doesn't return ordered atoms, and I think that the current SMARTS matching will return all 6 atoms in the '[#6]1:[#6]:[#6:4]:[#6:3]:[#7:2]:[#6:1]:1' ordered by index, rather than only the 4 tagged atoms in the order of tagging.

IAlibay commented 3 years ago

I see, do we have an ordering convention for select_atoms? In some ways I'd prefer we first make sure we fulfill one way of selection matching than start branching to other ones.

I'm not sure I'd have to use this feature enough to have a strong opinion for it. If the potential user base is large I'm happy for it to be implemented as long as it doesn't change expected defaults.

richardjgowers commented 3 years ago

I'm not sure we ever give a promise on the ordering from select_atoms, but our use of np.unique sorts the indices as a side-effect.