fixed list of amino acid three-letter codes may lead to unexpected results

edikedik commented 2 years ago

Hi,

The function filter_amino_acids utilizes a hardcoded list of residues corresponding to the canonical alphabet.

_ext_aa_list = ["ALA","ARG","ASN","ASP","CYS","GLN","GLU","GLY","HIS","ILE",
                "LEU","LYS","MET","PHE","PRO","SER","THR","TRP","TYR","VAL",
                "MSE", "ASX", "GLX", "SEC", "UNK"]

Although it seems sane, it may lead to unexpected results when non-canonical residues reside within the protein sequence. Indeed, in this case, function wrongly makes the protein chain discontinuous likely causing downstream errors.

I see three possible solutions.

Rename the function! For instance, filter_canonical_amino_acids clearly states the functionality.
Just extend the list and allow the user to input a custom one. For instance, residues such as "TPO", "PTR", and "SEP" are phosphorylated threonine, tyrosine, and serine. These are quite common in PDB. There are many other possible covalent modifications, either crafted by authors to improve the molecule's stability or added naturally as part of post-translational modifications. Of course, this makes it notoriously hard to account for all of them, yet it doesn't mean nothing should be done.
Extend/improve the function or create a new one. Although the current documentation is correct, the function may be wrongly perceived as the filter of amino acid residues that belong to a protein chain. Indeed, I suspect you planned it as such, as you rely on it in the filter_backbone. However, there are cases when amino acids are used as ligands. Thus, an extension/improvement, I believe, should rely on both protein continuity and maybe utilize the extended list I suggested above. For instance, an empirical workflow is to break the chain into continuous pieces and treat any residue bounded by residues from the hardcoded list (such as _ext_aa_list) as an amino acid.

Let me know what you think. If you agree that, at least, there is a problem, and we come to some consensus, I could implement it via PR. Thanks.

t0mdavid-m commented 2 years ago

In general I agree with @edikedik.

I would propose to use the _chem_comp.type property from the PDB Chemical Component Dictionary as it is already done in filter_nucleotides(). I would be happy to implement that as I already wrote the relevant scripts for filter_nucleotides().

Furthermore, as @edikedik suggested we could add a filter_canonical_amino_acids() and for completeness filter_canonical_nucleotides().

padix-key commented 2 years ago

This sounds like a good solution to me. Using the _chem_comp.type should give a comprehensive list of amino acids. I am also in favor of the filter_canonical_xxx() versions. @edikedik Would it be OK for you, if @tomtomhdx implements this functionality since, as he wrote, he already has experience in parsing the components.cif?

padix-key commented 2 years ago

3. Indeed, I suspect you planned it as such, as you rely on it in the filter_backbone

@edikedik You are correct. This could be problematic, if e.g. dihedral angles are calculated for a structure containing an amino acid ligand. Do you know whether amino acid ligands use ATOM or HETATM records in PDB entries? Could you give an example PDB ID?

edikedik commented 2 years ago

Would it be OK for you, if @tomtomhdx implements this functionality

Sure thing. To complete the list of supported standard (macromolecular) polymers, @tomtomhdx could also add filter_carbohydrates implemented similarly.

As for the filter_backbone, I believe the valid solution would be to leave it as it is but provide the function discriminating between polymer vs. non-polymer entities — something like filter_polymer_xxx where xxx is the polymer type ("peptide", for instance). Or one could allow providing the polymer type as the argument. This would allow the filtering of backbone atoms that actually belong to the protein chain and solve the issue of amino acid ligands.

@padix-key As for the latter, there are plenty. In general, typing any 3-letter code (ALA, for instance) results, among other things, in a list of entries where the molecule is used as a standalone ligand. Indeed, the ones I viewed used HETATM records.

edikedik commented 2 years ago

I'll throw in an example for the filter_polymer function I suggested above.

import biotite.structure as bst
from functools import partial

def filter_polymer(array, min_size=2, pol_type='peptide'):
    """

    Parameters
    ----------
    array : AtomArray or AtomArrayStack
        The array to filter.
    min_size : int
        The minimum number of monomers.
    pol_type : str
        The polymer type, either "peptide", "nucleotide", or "carbohydrate".
        Abbreviations are supported: "p", "pep", "n", etc.

    Returns
    -------
    filter : ndarray, dtype=bool
        This array is `True` for all indices in `array`, where atoms belong to
        consecutive polymer entity having at least `min_size` monomers.

    """
    split_idx = bst.check_res_id_continuity(array)
    check_pol = partial(_is_polymer, min_size=min_size, pol_type=pol_type)
    bool_idx = map(
        lambda a: np.full(len(a), check_pol(bst.array(a)), dtype=bool),
        np.split(array, split_idx)
    )
    return np.concatenate(list(bool_idx))

def _is_polymer(array, min_size, pol_type):

    if pol_type.startswith('p'):
        filt_fn = bst.filter_amino_acids
    elif pol_type.startswith('n'):
        filt_fn = bst.filter_nucleotides
    elif pol_type.startswith('c'):
        # filt_fn = bst.filter_carbohydrates
        pass
    else:
        raise ValueError(f'Unsupported polymer type {pol_type}')

    mask = filt_fn(array)
    return bst.get_residue_count(array[mask]) >= min_size

padix-key commented 2 years ago

I think filter_polymer() has a good use case, especially if a certain minimum length is required. Could you open a PR? Did you already test this function, especially whether np.split() works for non-ndarray objects?

edikedik commented 2 years ago

I think filter_polymer() has a good use case, especially if a certain minimum length is required. Could you open a PR? Did you already test this function, especially whether np.split() works for non-ndarray objects?

Sure. I tested it on protein molecules, it works as expected, although it would benefit from more rigorous testing. As it depends on other filtering functions discussed above, should I maybe wait until @tomtomhdx implements them? If you aren't planning to add filter_carbohydrates, the waiting is unnecessary.

padix-key commented 2 years ago

I think there is no need to hurry. @tomtomhdx Would you also like to implement filter_carbohydrates()?

Ivan Reveguk @.***> schrieb am Sa., 17. Sep. 2022, 14:40:

I think filter_polymer() has a good use case, especially if a certain minimum length is required. Could you open a PR? Did you already test this function, especially whether np.split() works for non-ndarray objects?

Sure. I tested it on protein molecules, it works as expected, although it would benefit from more rigorous testing. As it depends on other filtering functions discussed above, should I maybe wait until @tomtomhdx https://github.com/tomtomhdx implements them? If you aren't planning to add filter_carbohydrates, the waiting is unnecessary.

— Reply to this email directly, view it on GitHub https://github.com/biotite-dev/biotite/issues/424#issuecomment-1250064432, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGWAS6MBWY5KLT4K3DJNQKLV6W35JANCNFSM6AAAAAAQOKGMPM . You are receiving this because you were mentioned.Message ID: @.***>

t0mdavid-m commented 2 years ago

@tomtomhdx Would you also like to implement filter_carbohydrates()?

Yes will do!

edikedik commented 1 year ago

I've tested new functions, they work as expected. Thank you, @tomtomhdx !

biotite-dev / biotite

fixed list of amino acid three-letter codes may lead to unexpected results #424