About the training dataset

peter5842 commented 1 year ago

Hi, @amorehead

When I reproduce this work, I have some questions about the dataset. I use the ndata["x"] of complex["graph1"] and complex["graph2"] to check the postive labels and distance, but I get some comfused result. The postive labels created by distance map(<6 Angstrom) are less than the complex["examples"]. So I want to know the ndata["x"] is the bound complex coordinates?

Thanks!

peter5842 commented 1 year ago

Input:

print(processed_complex["complex"])
g1 = processed_complex["graph1"]
g2 = processed_complex["graph2"]
dist_map = torch.cdist(g1.ndata["x"], g2.ndata["x"], p=2, compute_mode="donot_use_mm_for_euclid_dist")
examples = processed_complex['examples']
print((dist_map<=6).sum())
print((examples[:, 2] == 1).sum())

Output:

1ykp.pdb1
tensor(51)
tensor(381)

peter5842 commented 1 year ago

Oh, sorry for this question! I think I have another question: Is the postive label determined by the atom distance?

amorehead commented 1 year ago

Hi, @peter5842. It depends on which dataset you are referring to. For DIPS-Plus and CASP-CAPRI datasets, these labels correspond to the labels generated using the bound complex coordinates. For DB5-Plus, these labels correspond to the labels generated using the unbound complex coordinates.

In general, a positive label is determined according to whether two heavy atoms (non-hydrogen atoms belonging to residues in different protein chains) are within 6 Angstrom of each other in the bound version of the protein complex. We use logic from the following atom3-py3 library to generate these labels for our datasets (https://github.com/amorehead/atom3/blob/master/atom3/neighbors.py).

I believe one potential reason you are getting a different result from the original number of labels is that here, by using ndata[x], you are only computing the distances between pairs of Ca atoms, not between backbone or side chain atoms. I hope this helps.

peter5842 commented 1 year ago

I'm so grateful for your help with this. I think I am clear for the dataset.

BioinfoMachineLearning / DeepInteract

About the training dataset #14