LPDI-EPFL / masif

MaSIF- Molecular surface interaction fingerprints. Geometric deep learning to decipher patterns in molecular surfaces.
Apache License 2.0
571 stars 151 forks source link

How can i docking two protein monomer by masif-search? #35

Open wenyuhaokikika opened 3 years ago

wenyuhaokikika commented 3 years ago

I want to docking two proteins. I just have two momomer,not Homopolymer or Heteromer.

my questions of masif-search pipeline

pipeline

run masif-site

Firstly,i run masif-site ./data_prepare_one.sh 2MWS_B,and then I run ./predict_site.sh 2MWS_B to predict sites,and i got four folds

00-raw_pdbs  01-benchmark_pdbs  01-benchmark_surfaces  04a-precomputation_9A

it run very well.

prepare data for ppi search

And Next run ppi searchcd ../masif_ppi_search,it need format PDBID_CHAIN1_CHAIN2,but i just have two momomer,not Homopolymer or Heteromer.the command /masif/data/masif_ppi_search/data_prepare_one.sh 3AXY_Bdoesn't work. image

and i down and extract pdb manually,move they into data_preparation/00-raw_pdbs/

root@eb92233498e0:/masif/data/masif_ppi_search# ls data_preparation/00-raw_pdbs/
1A2K.pdb  5JYL.pdb
root@eb92233498e0:/masif/data/masif_ppi_search# ls data_preparation/01-benchmark_pdbs/
1A2K_A.pdb  1A2K_C.pdb  5JYL_A.pdb  5JYL_B.pdb

next i run:

masif_root=$(git rev-parse --show-toplevel)
masif_source=$masif_root/source/
export PYTHONPATH=$PYTHONPATH:$masif_source
PDB_ID='1A2K'
CHAIN1='C'
CHAIN2='A'
# Load your environment here.
python $masif_source/data_preparation/01-pdb_extract_and_triangulate.py $PDB_ID\_$CHAIN1
python $masif_source/data_preparation/01-pdb_extract_and_triangulate.py $PDB_ID\_$CHAIN2
python $masif_source/data_preparation/04-masif_precompute.py masif_site $PDB_ID\_$CHAIN1
python $masif_source/data_preparation/04-masif_precompute.py masif_site $PDB_ID\_$CHAIN2
python $masif_source/data_preparation/04-masif_precompute.py masif_ppi_search $PDB_ID\_$CHAIN1
python $masif_source/data_preparation/04-masif_precompute.py masif_ppi_search $PDB_ID\_$CHAIN2

get the output

root@eb92233498e0:/masif/data/masif_ppi_search# ls data_preparation/04a-precomputation_9A/precomputation/1A2K_A/
p1_X.npy  p1_Z.npy             p1_input_feat.npy    p1_mask.npy            p1_theta_wrt_center.npy
p1_Y.npy  p1_iface_labels.npy  p1_list_indices.npy  p1_rho_wrt_center.npy
root@eb92233498e0:/masif/data/masif_ppi_search# ls data_preparation/04a-precomputation_9A/precomputation/1A2K_C
p1_X.npy  p1_Z.npy             p1_input_feat.npy    p1_mask.npy            p1_theta_wrt_center.npy
p1_Y.npy  p1_iface_labels.npy  p1_list_indices.npy  p1_rho_wrt_center.npy

and then calculate the description

./compute_descriptors.sh $PDB_ID\_$CHAIN1
./compute_descriptors.sh $PDB_ID\_$CHAIN2

and get the output

root@eb92233498e0:/masif/data/masif_ppi_search# ls descriptors/sc05/all_feat/1A2K_C
p1_desc_flipped.npy  p1_desc_straight.npy
root@eb92233498e0:/masif/data/masif_ppi_search# ls descriptors/sc05/all_feat/1A2K_A
p1_desc_flipped.npy  p1_desc_straight.npy

it doesn't raise any Exception just some warning. for another protein,wo also do so.

python $masif_source/data_preparation/01-pdb_extract_and_triangulate.py 5JYL_A
python $masif_source/data_preparation/04-masif_precompute.py masif_site 5JYL_A
python $masif_source/data_preparation/04-masif_precompute.py masif_ppi_search 5JYL_A
./compute_descriptors.sh 5JYL_A

python $masif_source/data_preparation/01-pdb_extract_and_triangulate.py 5JYL_B
python $masif_source/data_preparation/04-masif_precompute.py masif_site 5JYL_B
python $masif_source/data_preparation/04-masif_precompute.py masif_ppi_search 5JYL_B
./compute_descriptors.sh 5JYL_B

nohup sh ./data_prepare_one.sh  5JYL_A_B &
nohup sh ./data_prepare_one.sh 1A2K_C_A &
nohup sh ./compute_descriptors.sh 5JYL_A_B &
nohup sh ./compute_descriptors.sh 1A2K_C_A &

i see the py file/masif/source/masif_ppi_search/second_stage_alignment_nn.py it need some necessary setting variables.

masif_opts = {}
masif_opts["pdb_chain_dir"] = "data_preparation/01-benchmark_pdbs/"#* 每个蛋白对应的链
masif_opts["ply_chain_dir"] = "data_preparation/01-benchmark_surfaces/"#* 蛋白对应的表面 用site计算出来的
masif_opts["ppi_search"]={}
masif_opts["ppi_search"][
    "masif_precomputation_dir"
] = "data_preparation/04b-precomputation_12A/precomputation/"
masif_opts["ppi_search"]["desc_dir"] = "descriptors/sc05/all_feat/"#*  这里通过compute_description.sh计算得出
masif_opts["ppi_search"]["gif_descriptors_out"] = "gif_descriptors/"#方法gif才需要,masif不需要 空的目录文件夹
masif_opts["site"]={}
masif_opts["site"][
    "masif_precomputation_dir"
] = "data_preparation/04a-precomputation_9A/precomputation/"

it is global variable. define my protein list:

root@eb92233498e0:/masif/data/masif_ppi_search# for i in $(ls data_preparation/01-benchmark_pdbs);do echo ${i/.pdb/}; done > lists/mylist.txt
root@eb92233498e0:/masif/data/masif_ppi_search# cat lists/mylist.txt
1A2K_A
1A2K_C
5JYL_A
5JYL_B
import os
import numpy as np
# Location of surface (ply) files. 
data_dir='/masif/data/masif_ppi_search'
surf_dir = os.path.join(data_dir, masif_opts["ply_chain_dir"])
desc_dir = os.path.join(data_dir, masif_opts["ppi_search"]["desc_dir"])
pdb_dir = os.path.join(data_dir, masif_opts["pdb_chain_dir"])
precomp_dir = os.path.join(
    data_dir, masif_opts["ppi_search"]["masif_precomputation_dir"]
)
precomp_dir_9A = os.path.join(
    data_dir, masif_opts["site"]["masif_precomputation_dir"]
)
benchmark_list=os.path.join(data_dir, 'lists','mylist.txt')
pdb_list = open(benchmark_list).readlines()[0:100]
pdb_list = [x.rstrip() for x in pdb_list]
# Read all surfaces.
all_pc = []
all_desc = []

rand_list = np.copy(pdb_list)
#np.random.seed(0)
np.random.shuffle(rand_list)
rand_list = rand_list[0:100]

p2_descriptors_straight = []
p2_point_clouds = []
p2_patch_coords = []
p2_names = []

lack file p1_sc_labels.npy,

root@eb92233498e0:/masif/source/masif_ppi_search# ls /masif/data/masif_ppi_search/data_preparation/04b-precomputation_12A/precomputation/5JYL_B/
p1_X.npy  p1_Z.npy             p1_input_feat.npy    p1_mask.npy            p1_theta_wrt_center.npy
p1_Y.npy  p1_iface_labels.npy  p1_list_indices.npy  p1_rho_wrt_center.npy
root@eb92233498e0:/masif/source/masif_ppi_search# ls /masif/data/masif_ppi_search/data_preparation/04b-precomputation_12A/precomputation/5JYL_A_B/
p1_X.npy             p1_input_feat.npy      p1_sc_labels.npy         p2_Z.npy             p2_mask.npy
p1_Y.npy             p1_list_indices.npy    p1_theta_wrt_center.npy  p2_iface_labels.npy  p2_rho_wrt_center.npy
p1_Z.npy             p1_mask.npy            p2_X.npy                 p2_input_feat.npy    p2_sc_labels.npy
p1_iface_labels.npy  p1_rho_wrt_center.npy  p2_Y.npy                 p2_list_indices.npy  p2_theta_wrt_center.npy

i move file but i don't know whether it is right

cd /masif/data/masif_ppi_search/data_preparation/04b-precomputation_12A/precomputation
cp 1A2K_C_A/p1_sc_labels.npy  1A2K_C/p1_sc_labels.npy
cp 1A2K_C_A/p2_sc_labels.npy  1A2K_A/p1_sc_labels.npy
cp 5JYL_A_B/p1_sc_labels.npy 5JYL_A/p1_sc_labels.npy
cp 5JYL_A_B/p2_sc_labels.npy 5JYL_B/p1_sc_labels.npy

move model to workplace

cd /masif/source/masif_ppi_search/
cp -r /masif/comparison/masif_ppi_search/masif_descriptors_nn/models .

run docking by masif-search

cd /masif/source/masif_ppi_search; touch a new python scripttouch docking.py

import scipy.sparse as spio
import copy
from Bio.PDB import *
from scipy.spatial import cKDTree
from transformation_training_data.score_nn import ScoreNN
from alignment_utils_masif_search import compute_nn_score, rand_rotation_matrix, \
        get_center_and_random_rotate, get_patch_geo, multidock, test_alignments, \
       subsample_patch_coords
import time
import sklearn.metrics
masif_opts = {}
masif_opts["pdb_chain_dir"] = "data_preparation/01-benchmark_pdbs/"#* 每个蛋白对应的链
masif_opts["ply_chain_dir"] = "data_preparation/01-benchmark_surfaces/"#* 蛋白对应的表面 用site计算出来的
masif_opts["ppi_search"]={}
masif_opts["ppi_search"][
    "masif_precomputation_dir"
] = "data_preparation/04b-precomputation_12A/precomputation/"
masif_opts["ppi_search"]["desc_dir"] = "descriptors/sc05/all_feat/"#*  这里通过compute_description.sh计算得出
masif_opts["ppi_search"]["gif_descriptors_out"] = "gif_descriptors/"#方法gif才需要,masif不需要 空的目录文件夹
masif_opts["site"]={}
masif_opts["site"][
    "masif_precomputation_dir"
] = "data_preparation/04a-precomputation_9A/precomputation/"
nn_model = ScoreNN()
import os
import numpy as np
data_dir='/masif/data/masif_ppi_search'
surf_dir = os.path.join(data_dir, masif_opts["ply_chain_dir"])
desc_dir = os.path.join(data_dir, masif_opts["ppi_search"]["desc_dir"])
pdb_dir = os.path.join(data_dir, masif_opts["pdb_chain_dir"])
precomp_dir = os.path.join(
    data_dir, masif_opts["ppi_search"]["masif_precomputation_dir"]
)
precomp_dir_9A = os.path.join(
    data_dir, masif_opts["site"]["masif_precomputation_dir"]
)
benchmark_list=os.path.join(data_dir, 'lists','mylist.txt')
pdb_list = open(benchmark_list).readlines()[0:100]
pdb_list = [x.rstrip() for x in pdb_list]
# Read all surfaces.
all_pc = []
all_desc = []

rand_list = np.copy(pdb_list)
#np.random.seed(0)
np.random.shuffle(rand_list)
rand_list = rand_list[0:100]

p2_descriptors_straight = []
p2_point_clouds = []
p2_patch_coords = []
p2_names = []

from geometry.open3d_import import *
for i, pdb in enumerate(rand_list):
    print("Loading patch coordinates for {}".format(pdb))
    pdb_id = pdb.split("_")[0]
    chains = pdb.split("_")[1]
    # Descriptors for global matching.
    p2_descriptors_straight.append(
        np.load(os.path.join(desc_dir, pdb, "p1_desc_straight.npy"))
    )
    p2_point_clouds.append(
        read_point_cloud(
            os.path.join(surf_dir, "{}.ply".format(pdb_id + "_" + chains))
        )
    )
    pc = subsample_patch_coords(pdb, "p1", precomp_dir_9A)
    p2_patch_coords.append(pc)
    p2_names.append(pdb)

all_positive_scores = []
all_positive_rmsd = []
all_negative_scores = []
# Match all descriptors.
count_found = 0
all_rankings_desc = []

# Now go through each target (p1 in every case) and dock each 'decoy' binder to it. 
# The target will have flipped (inverted) descriptors.
K=30
ransac_iter=100
ttf=[]
for target_ix, target_pdb in enumerate(rand_list):
    target_pdb_id = target_pdb.split("_")[0]
    chains = target_pdb.split("_")[1]
    # Load target descriptors for global matching.
    target_desc = np.load(os.path.join(desc_dir, target_pdb, "p1_desc_flipped.npy"))
    # Load target point cloud
    target_pc = os.path.join(surf_dir, "{}.ply".format(target_pdb_id + "_" + chains))
    target_pcd = read_point_cloud(target_pc)
    # Read the point with the highest shape compl.
    sc_labels = np.load(os.path.join(precomp_dir, target_pdb, "p1_sc_labels.npy"))
    center_point = np.argmax(np.median(np.nan_to_num(sc_labels[0]), axis=1))
    # Go through each source descriptor, find the top descriptors, store id+pdb
    num_negs = 0
    all_desc_dists = []
    all_pdb_id = []
    all_vix = []
    gt_dists = []
    # This is where the desriptors are actually compared (stage 1 of the MaSIF-search protocol)
    for source_ix, source_pdb in enumerate(rand_list):
        source_desc = p2_descriptors_straight[source_ix]
        desc_dists = np.linalg.norm(source_desc - target_desc[center_point], axis=1)
        all_desc_dists.append(desc_dists)
        all_pdb_id.append([source_pdb] * len(desc_dists))
        all_vix.append(np.arange(len(desc_dists)))
        if source_pdb == target_pdb:
            source_pcd = p2_point_clouds[source_ix]
            eucl_dists = np.linalg.norm(
                np.asarray(source_pcd.points)
                - np.asarray(target_pcd.points)[center_point, :],
                axis=1,
            )
            eucl_closest = np.argsort(eucl_dists)
            gt_dists = desc_dists[eucl_closest[0:50]]
            gt_count = len(source_desc)
    all_desc_dists = np.concatenate(all_desc_dists, axis=0)
    all_pdb_id = np.concatenate(all_pdb_id, axis=0)
    all_vix = np.concatenate(all_vix, axis=0)
    ranking = np.argsort(all_desc_dists)
    # Load target geodesic distances.
    target_coord = subsample_patch_coords(target_pdb, "p1", precomp_dir_9A, [center_point])
    # Get the geodesic patch and descriptor patch for the target.
    target_patch, target_patch_descs = get_patch_geo(
        target_pcd, target_coord, center_point, target_desc, flip=True
    )
    # Make a ckdtree with the target.
    target_ckdtree = cKDTree(target_patch.points)
    ## Load the structures of the target and the source (to get the ground truth).
    parser = PDBParser()
    target_struct = parser.get_structure(
        "{}_{}".format(target_pdb_id, chains[0]),
        os.path.join(pdb_dir, "{}_{}.pdb".format(target_pdb_id, chains)),
    )
    #gt_source_struct = parser.get_structure(
    #    "{}_{}".format(target_pdb_id, chains[1]),
    #    os.path.join(pdb_dir, "{}_{}.pdb".format(target_pdb_id, chains[1])),
    #)
    # Get coordinates of atoms for the ground truth and target.
    target_atom_coords = [atom.get_coord() for atom in target_struct.get_atoms()]
    target_ca_coords = [
        atom.get_coord() for atom in target_struct.get_atoms() if atom.get_id() == "CA"
    ]
    target_atom_coord_pcd = PointCloud()
    target_ca_coord_pcd = PointCloud()
    target_atom_coord_pcd.points = Vector3dVector(np.array(target_atom_coords))
    target_ca_coord_pcd.points = Vector3dVector(np.array(target_ca_coords))
    target_atom_pcd_tree = KDTreeFlann(target_atom_coord_pcd)
    target_ca_pcd_tree = KDTreeFlann(target_ca_coord_pcd)
    found = False
    myrank_desc = float("inf")
    chosen_top = ranking[0:K]
    pos_scores = []
    pos_rmsd = []
    neg_scores = []
    # This is where the matched descriptors are actually aligned.
    for source_ix, source_pdb in enumerate(rand_list):
        viii = chosen_top[np.where(all_pdb_id[chosen_top] == source_pdb)[0]]
        source_vix = all_vix[viii]
        if len(source_vix) == 0:
            continue
        source_desc = p2_descriptors_straight[source_ix]
        source_pcd = copy.deepcopy(p2_point_clouds[source_ix])
        source_coords = p2_patch_coords[source_ix]
        # Randomly rotate and translate.
        random_transformation = get_center_and_random_rotate(source_pcd)
        source_pcd.transform(random_transformation)
        # Dock and score each matched patch. 
        #print({'source_pcd':source_pcd,'source_coords':source_coords,'source_desc':source_desc,'source_vix':source_vix\
        #,'target_patch':target_patch,'target_patch_descs':target_patch_descs,'target_ckdtree':target_ckdtree,'ransac_iter':ransac_iter})
        if source_pdb!=target_pdb:#same structure does not need docking
            all_results, all_source_patch, all_source_scores = multidock(
                source_pcd,
                source_coords,
                source_desc,
                source_vix,
                target_patch,
                target_patch_descs,
                target_ckdtree,
                nn_model, 
                ransac_iter=ransac_iter
            )
            res={'target_pdb':target_pdb,'source_pdb':source_pdb,'all_results':all_results,\
            'all_source_patch':all_source_patch,'all_source_scores':all_source_scores}
            ttf.append(res)
            #ttf 返回的值代表好几种对接方式,通过指纹搜索来确定种类

see ttf[0]

>>> ttf[0]
{'target_pdb': '1A2K_C', 'source_pdb': '5JYL_B', 'all_results': [registration::RegistrationResult with fitness=0.000000e+00, inlier_rmse=0.000000e+00, and correspondence_set size of 0
Access transformation to get result., registration::RegistrationResult with fitness=0.000000e+00, inlier_rmse=0.000000e+00, and correspondence_set size of 0
Access transformation to get result., registration::RegistrationResult with fitness=2.600000e-01, inlier_rmse=6.074436e-01, and correspondence_set size of 26
Access transformation to get result., registration::RegistrationResult with fitness=0.000000e+00, inlier_rmse=0.000000e+00, and correspondence_set size of 0
Access transformation to get result.], 'all_source_patch': [geometry::PointCloud with 100 points., geometry::PointCloud with 100 points., geometry::PointCloud with 100 points., geometry::PointCloud with 100 points.], 'all_source_scores': [0.0014200918, 0.001413941, 2.9317687e-05, 0.001454716]}

all_results[0]is <open3d.open3d.registration.RegistrationResult>object,fitnessmeans correspondence point,if it eq 0, the structure is bad. but how can i transform from RegistrationResult and PointCloud to pdb format.I want to get the dockered structure. I am Novice for docking,please help me. Thank you.

jh4yd commented 2 years ago

I don't think masif gives the docked structure. only the potential docking sites.