a-r-j / graphein

Protein Graph Library
https://graphein.ai/
MIT License
1.02k stars 131 forks source link

RNA graph construction and KNN representation #109

Closed rg314 closed 2 years ago

rg314 commented 2 years ago

I’ve just started looking at RNA graph construction. Ideally, I’d like to generate a KNN representation of the RNA. This function is currently implemented for proteins by using the graphein.protein.edges.distance.add_k_nn_edges function. In short, the edges for the KNN method are added by:

  1. compute distance matrix a. To compute the distance matrix we need to know the x,y,z position of each basepair (BP) of RNA
  2. Compute N nearest neighbours using (sklearn.neighbors.kneighbors_graph)
  3. Join interacting nodes calculated form 2.
  4. Return graph

At the moment the x,y,z cords for protein structures are obtained from a PDB file. This is currently not built for RNA structures. For an RNA sequence we must use the sequence and/or dot bracket notation to get the 3D structural information.

If the dot bracket notation is not provided and can be calculated using Nussinov Algorithm (DP approach, see https://github.com/cgoliver/Nussinov/blob/master/nussinov.py for python implementation). See implementation https://github.com/rg314/graphein/blob/rna-model/graphein/rna/nussinov.py

Note that nussinov algo does not guarantee that the dot-bracket notation is correct. There are several other ways of computing this.

The PDB database contains some RNA structures (~5233). PandasPdb can be used to directly read in the PDB file. I suggest that the current protein config is adapted for the RNA structure to read in the RNA structure from a PDB file. @a-r-j what do you think? I have started to implement this please see (https://github.com/rg314/graphein/blob/35bd2297d28bf09bcf0fb98c10c3866d4be6cb83/graphein/rna/graphs.py#L209 note reading in df is currently failing).

Then we can look at alternative sources for reading in the structure.

For example, it appears that the Xiao lab http://biophy.hust.edu.cn/new/ has a RESTful API to return RNA structure. However, I have not investigated this in detail and if it returns the correct 3D data. This could somewhat mimic the behaviour of graphein.protein.utils.download_alphafold_structure.

Does anyone have an idea of other databases that could be used?

I’m also open to creating a server that can be contacted with a RESTful API to predict RNA structure. However, we would need to figure out the best implementation for structure prediction (and make sure it doesn’t take too long 😉).

a-r-j commented 2 years ago

Hey, thanks for this Ryan! Looks exciting!

So, I think we should keep RNA secondary structure & 3D structure separate for now. The secondary structure is functional as a standalone piece of functionality (though it would be really nice to hook it up to Nussinov or bpRNA - the largest database I know of).

With respect to 3D graphs - I had a quick look at this. I think it's actually quite straightforward as most of the components are implemented for protein structure graphs. Essentially, we can use the low-level API in graphein as building blocks and make a function more or less identical to the construct_graphs we use for proteins. The main things I saw so far that need changing:

We need some granularity options for RNA graphs

Then, we simply add a new function convert_structure_to_rna in this block eg.


RNA_ATOMS = [
    "C1'",
    "C2",
    "C2'",
    "C3'",
    "C4",
    "C4'",
    "C5",
    "C5'",
    "C6",
    "C8",
    "N1",
    "N2",
    "N3",
    "N4",
    "N6",
    "N7",
    "N9",
    "O2",
    "O2'",
    "O3'",
    "O4",
    "O4'",
    "O5'",
    "O6",
    "OP1",
    "OP2",
    "P",
]

def subset_structure_to_rna(
    df: pd.DataFrame,
) -> pd.DataFrame:
    """
    Return a subset of atomic dataframe that contains only certain atom names relevant for RNA structures.

    :param df: Protein Structure dataframe to subset
    :type df: pd.DataFrame
    :returns: Subsetted protein structure dataframe
    :rtype: pd.DataFrame
    """
    return filter_dataframe(
        df, by_column="atom_name", list_of_values=RNA_ATOMS, boolean=True
    )

but more flexible (not keeping the RNA_ATOMS fixed so users can subset as they wish)

The only other line that breaks is this one and we easily fix it by removing the three_to_1 call if we're constructing an RNA graph. Then we're good to go essentially. The graph has been populated with the nodes and we write whatever edge functions we like to go on top as per the protein API.

What I'm unfamiliar with is how we coarsen the RNA graphs. E.g. all atom is what I've described above. For proteins it's obviously very normal to consider the alpha carbon trace as representative of a residue-level graph. I'm not sure what the standard for RNA is. In any case, we can leave this open to users with the granularity param. What do you think?

a-r-j commented 2 years ago

Came across this today: https://www.biorxiv.org/content/10.1101/2022.03.14.484334v1

Might be of interest to you @rg314

rg314 commented 2 years ago

Just to follow up on this... we found that the nussinov.py algo isn't great at predicting the dot-bracket notation. I suggest that we create a container running https://github.com/rg314/centroid-rna-package and ping it to get the centroid secondary structure. What do you think @a-r-j ?

a-r-j commented 2 years ago

Implemented in 1.5.0