LPDI-EPFL / masif

MaSIF- Molecular surface interaction fingerprints. Geometric deep learning to decipher patterns in molecular surfaces.
Apache License 2.0
591 stars 156 forks source link

Extract the residue numbers of a patch #70

Open TobiasPol opened 1 year ago

TobiasPol commented 1 year ago

I am currently working with MaSIF-search and am looking for a way to extract the residual numbers of a patch to validate the results. Does anyone know if this is possible?

celalp commented 11 months ago

Hi @TobiasPol

Any luck? I would be interested in this as well.

TobiasPol commented 11 months ago

Hey @celalp, not yet unfortunately. Do you have any idea for the solution or do you know if it is even possible?

celalp commented 11 months ago

I'm not so sure if we can with certanity, I have a tiny pdb that I've been using to understand the code in greater depth, seems like the reduction in the number of vertices from the original msms output happens during the fixmesh function. There is a whole bunch of joining and collapsing happening during that. This seems to remove the vertex information and create new vertices and faces and there is no tracking of what's going on not just in the code but also from the point of view of pymesh, (pymesh dont care).

There are couple of things I'm looking into now,

1) in compute_charges we use the 4 closest charges from the original vertices, this is done using the kdtree algorithm, seems a bit backwards in the sense that we had this information, we lost it and now we are approximating it because we actually need it. Now can we assume that the closest 4 vertices are the actual orginial vertices before fixmesh, I don't think so, not all the time (see below).

2) doing something similar to compute charges and getting the coords myself, this is however done in euclidean space not geodesic if I'm not mistaken so same issues above, There is also no guarantee that we will get the correct number, for charges it may not matter because it's a simplification.

3) I looked into if i can map some of the vertices from the original mesh, and the answer is kinda no again for the reasons for fixmexh.

4) There are some functions to get patches that are not used in the calculations but either are for pymol or for debugging/sanity checks, if we can get the coords of the patch we can get the closest atoms and therefore aas, but this is kinda the same things as 2 and is not exact also the number of vertices and faces in each of the old/new mesh are not the same, for visualization this does not matter, for figuring out the aa it might, i don't really know. Probably depends on how steep the curvature of the surface is. As long as geodesic is close to euclidean should be pretty much the same.

I think I'm going to go with 2 and just say hey, these are the closest ones, we are dealing with atom coordinates and Ca's of aas pretty far from one another and I'm hoping that it would be right or very close to right.

To clarify this needs to be done on the vertex coordinates before you convert them to polar, I'm thinking about writing couple of functions that does the above calculations and saves them in a class of sorts and for I/O think and hdf5 per protein/chain while a bit expensive might be the simplest to implement.

This is my interpretation of the code and considering the spare commenting dont take this as "legal advice" and if you have any ideas I would love to hear.

Honestly, all i want to do is compare protein surfaces at the moment, I just want to see if patch x from protein a is similar to patch y from protein b and then get the aas involved. When I first started I was so sure that I would find something simple enough but I guess coordinate and rotation invariance requirement makes it challenging.

I'm hesitant to use learned features because I will not be working with human/mouse/even animal proteins for this specific project.

Sorry for the wall of text and rant.

TobiasPol commented 11 months ago

Thank you for your detailed answer! That helps me a lot. I would prefer to have an output containing the two comparative patches with the coordinates and the respective residue numbers. Something like this: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

query | match | match_cen | match_patch | query_cen | query_patch -- | -- | -- | -- | -- | -- 7mjr_A | 7ml9_A_152 | [-7.37 76.34 82.97] | {'34', '48', '47', '35', '49', '37', '36', '51'} | [ 9.6537145 95.1163015 74.047997 ] | {'504', '512', '518', '519', '520', '503', '516', '517', '506', '505'}

Unfortunately, I only have this table and not the code behind it. And as you say, the code is very poorly documented.

celalp commented 11 months ago

may I ask where you got this table?

I was thinking more along the lines of getting closest atom to each vertex on the processed mesh, then it's just a matter of a lookup for each vertex in each patch. I would love to have an actual truth set even for a handful of cases to test out the code.

celalp commented 11 months ago

@TobiasPol Ok, following the authors' lead, here is a small snippet that finds the n closest amino acids. If you can verify this using your data I would be grateful

from sklearn.neighbors import KDTree

def map_aas(old_verts, new_verts, names, num_inter):
    """
    old_verts: vertices from the initial mesh from compute MSMS
    new_verts: vertices from the output of fixmesh
    names: names1 from again compute MSMS
    num_inter: how many closest aas to return

   returns an ndarray of new_verts rows and num_inter columns, the values indicate the aa number coming from computeMSMS output
    """
    kdt = KDTree(old_verts)
    dists, result = kdt.query(new_verts, k=num_inter)
    aas=np.array([int(name.split("_")[1]) for name in names])
    closest_aas=np.zeros_like(dists)
    for i in range(num_inter):
        closest_aas[:, i]=aas[result[:, i]]

    return closest_aas
biochunan commented 11 months ago

I had a similar question when using MaSIF-site when I was trying to map the predicted sites to residues in the input structure. I wrote a script to map vertices (with a high probability of being a binding site) to nearby residues in the input structure. Not sure if this is helpful with MaSIF-search.

The script prints the mapped residues along with a PyMOL CLI command to select them.

'''
Requires biopandas to parse pdb file
pip install biopandas 

Usage:

python map-highprob-vertices-to-residues.py \
    --job_name 1cz8_V \
    --data_preparation /path/to/masif_output/data_preparation \
    --output /path/to/masif_output/pred_output \
    --prob_thr 0.7 --radius 1.2 

'''
import argparse
import numpy as np
from typing import List
from pathlib import Path
from pprint import pprint
from biopandas.pdb import PandasPdb

def match_atoms(target_coords: np.ndarray, query_coords: np.ndarray, radius: float) -> List[int]:
    """ 
    Find atoms that are within a radius from the high-prob vertices 

    Args:
        target_coords(np.ndarray): shape (N, 3), coordinates of the vertices
        query_coords (np.ndarray): shape (N, 3), coordinates of the atoms 
        radius (float)           : radius in Å cutoff to retrieve atoms close to vertices

    Returns:
        idx (List[int]): indices of the atoms IN THE QUERY_COORDS 
            that are within a radius from the vertices
    """
    from scipy.spatial import cKDTree

    tree = cKDTree(query_coords)  # indexing the atom coordinates 
    # get atoms that are within a radius from the vertices
    idx = tree.query_ball_point(target_coords, r=radius)
    # flatten the list of lists
    idx = [item for sublist in idx for item in sublist]
    # remove duplicates
    idx = list(set(idx))

    return idx 

# input   
def cli():
    parser = argparse.ArgumentParser(formatter_class=argparse.RawTextHelpFormatter)
    parser.add_argument("--job_name", type=str, help="job name assigned to the inference run, e.g. 1cz8_V")
    parser.add_argument("--data_preparation", type=str, help="path to data_preparation folder")
    parser.add_argument("--output", type=str, help="path to the inference output folder")
    parser.add_argument("--prob_thr", type=float, default=0.7,
                        help="probability threshold to filter vertices")
    parser.add_argument("--radius", type=float, default=1.2,
                        help="radius in Å to retrieve atoms close to vertices")

    args = parser.parse_args()

    return args

if __name__ == "__main__":
    args = cli()
    job_name = args.job_name
    inp_path = Path(args.data_preparation)
    out_path = Path(args.output)

    # --------------------
    # file paths 
    # --------------------
    # pred probabilities of each vertex being part of a binding site
    pred_fp = out_path.joinpath('all_feat_3l', 'pred_data', f'pred_{job_name}.npy')
    # input pdb file to masif-site 
    in_pdb_fp = inp_path.joinpath('01-benchmark_pdbs', f'{job_name}.pdb')

    # --------------------
    # read pred files
    # --------------------
    # Load the predicted surface numpy file. 
    # If running inside a container, this is located by default under `/masif/data/masif_site/output/all_feat_3l/pred_surfaces`
    # e.g. `/masif/data/masif_site/output/all_feat_3l/pred_surfaces/1cz8_V.npy`
    probs = np.load(pred_fp)

    # Load vertex coordinates from precomputed files 
    # If running in the container, this is located under `/masif/data/masif_site/data_preparation/04a-precomputation_9A/precomputation/`
    # e.g. `/masif/data/masif_site/data_preparation/04a-precomputation_9A/precomputation/1cz8_V`
    # - `1cz8_V` is the job name I used for this example.
    precompute_folder = inp_path.joinpath('04a-precomputation_9A', 'precomputation', job_name)

    # We only need the coordinates, other features are not needed in this notebook
    p1_X = np.load(precompute_folder/'p1_X.npy')
    p1_Y = np.load(precompute_folder/'p1_Y.npy')
    p1_Z = np.load(precompute_folder/'p1_Z.npy')
    print(f'p1_X.shape:                {p1_X.shape}')
    print(f'p1_Y.shape:                {p1_Y.shape}')
    print(f'p1_Z.shape:                {p1_Z.shape}')

    # --------------------
    # map high prob vertices to 
    # residues in the structure 
    # --------------------
    # use biopandas to parse the pdb file 
    pdb = PandasPdb().read_pdb(in_pdb_fp) 

    # convert to dataframe
    atom_df = pdb.df['ATOM']
    # add node_id in the format of [chain_id]:[residue_name]:[residue_number]:[insertion]
    atom_df['node_id'] = atom_df['chain_id'] + ':' + atom_df['residue_name'] + ':' + atom_df['residue_number'].astype(str) + ':' + atom_df['insertion']
    # remove the tailing space and colon in the node_id if insertion is empty
    atom_df['node_id'] = atom_df['node_id'].str.replace(r':\s*$', '', regex=True)

    # --------------------
    # Find atoms close to the 
    # predicted surface vertices
    # --------------------
    # params 
    prob_thr = args.prob_thr  # prob thr to filter vertices 
    radius = args.radius      # radius in Å to retrieve atoms close to vertices 

    # get the coordinates of the vertices (shape N_vertices, 3)
    vertices_coords = np.concatenate([p1_X.reshape(-1, 1), 
                                    p1_Y.reshape(-1, 1), 
                                    p1_Z.reshape(-1, 1)], axis=1)

    # get vertex with the probability greater than a threshold e.g. 0.9 
    _, idx = np.where(probs > prob_thr)

    # get the coordinates of the vertices with the probability greater than a threshold
    hp_coords = vertices_coords[idx]

    # atom coordinates from the pdb file 
    atom_coords = atom_df.loc[:, ['x_coord', 'y_coord', 'z_coord']].to_numpy()

    # atom df row indices
    idx = match_atoms(hp_coords, atom_coords, radius)

    # pymol selection string
    resis = atom_df.iloc[idx].drop_duplicates(subset=['node_id']).node_id.map(lambda x: x.split(':')[-1]).unique()
    chain_id = atom_df.chain_id.values[0]
    sel_str = f'select pred, ///{chain_id}/{"+".join(resis)}'
    print('PyMOL selection string (copy paste it in PyMOL CLI):')
    print(sel_str)

    # print residues 
    print('Residue IDs:')
    for i in atom_df.iloc[idx].sort_values('residue_number')['node_id'].to_list():
        print(i)

mapped

Output

PyMOL selection string (copy paste it in PyMOL CLI):
select pred, ///V/86+21+29+53+15+81+16+31+24+32+17+83+48+91+18+49+58+59+78+20+27+85+60+79

Residue IDs:
V:VAL:15
V:LYS:16
V:PHE:17
V:MET:18
V:VAL:20
V:TYR:21
V:SER:24
V:HIS:27
V:ILE:29
V:THR:31
V:LEU:32
V:LYS:48
V:PRO:49
V:PRO:53
V:GLY:58
V:GLY:59
V:CYS:60
V:MET:78
V:GLN:79
V:MET:81
V:ILE:83
V:PRO:85
V:HIS:86
V:ILE:91
TobiasPol commented 11 months ago

@celalp and @biochunan thanks for your replies! I will take a look at it and verify the results.

TobiasPol commented 11 months ago

Quick Update: There is a problem with the Python version. I see that Pymesh comes with Python 3.6.11, but I need at least 3.7. Does anyone happen to have a working Dockerfile?

biochunan commented 11 months ago

Mine was directly pulled from Docker hub according to the tutorial (https://github.com/LPDI-EPFL/masif/blob/master/docker_tutorial.md). I had problems setting up local environment for the dependencies, so I changed to from inside the container instead and it works well for me.

docker pull pablogainza/masif:latest
TobiasPol commented 11 months ago

@biochunan It doesn't work with the Docker container either. I still get this error:

Traceback (most recent call last): File "/home/ec2-user/masif_seed/masif/source//masif_site/map_highprob_vertices_to_residues.py", line 19, in <module> from biopandas.pdb import PandasPdb File "/usr/local/lib/python3.6/site-packages/biopandas/pdb/__init__.py", line 12, in <module> from .pandas_pdb import PandasPdb File "/usr/local/lib/python3.6/site-packages/biopandas/pdb/pandas_pdb.py", line 8 from __future__ import annotations ^ SyntaxError: future feature annotations is not defined

biochunan commented 11 months ago

Hi @TobiasPol, apologies that I misunderstood your last question. This is indeed a Python version compatibility issue. I was running my script outside the container. What I did was running the masif-site using the container and save its outputs to host directory via mounting using flag '-v' e.g.

WD=$(pwd)         # working directory
pdbid="1a2y"      # pdb code, e.g. 1a2y for specifying the pdb file path
jobname="1a2y_C"  # pdb code with chain ID(s), e.g. 1a2y_C is the antigen chain

docker run --rm \
  -v ${WD}/:/dataset/ \
  -v ${WD}/masif_output/data_preparation:/masif/data/masif_site/data_preparation  \
  -v ${WD}/masif_output/pred_output:/masif/data/masif_site/output  \
  -w /masif/data/masif_site \
  pablogainza/masif \
  /masif/data/masif_site/data_prepare_one.sh  \
  --file /dataset/${pdbid}.pdb ${jobname}

I have attached an example for your reference. You just need to cd into the folder and run run.sh with a python environment compatible with biopandas etc. (I'm using Python 3.11.4 and biopandas==0.5.1.dev0) example.tar.gz