BioinfoMachineLearning / DeepInteract

A geometric deep learning pipeline for predicting protein interface contacts. (ICLR 2022)
https://zenodo.org/record/6671582
GNU General Public License v3.0
62 stars 11 forks source link

[Doc]About pdb files #9

Closed terry-r123 closed 2 years ago

terry-r123 commented 2 years ago

Hi, @amorehead, can you provide the original 32 pdb files for the DIPS-Plus dataset and 55 pdb files for the DB5 dataset? And how to process the original pdb files into pdb.dill files for this project.

Thanks!

amorehead commented 2 years ago

Hi, @terry-r123.

Sure thing! In terms of how to process them into .dill graph files, I believe a quick solution for these files could be to use our Docker inference pipeline on each pair of files (e.g., example_l_b.pdb and example_r_b.pdb) to convert them into graph files in your intermediate working directory of choice (e.g., datasets/Input). If you would like to speed things up even more quickly (e.g., by using multiple CPU cores in parallel), then I recommend you look at following the traditional setup guide for DeepInteract, as with a traditional installation you can then run our original graph processing script (assuming you are using the same directory structure we list here: https://github.com/BioinfoMachineLearning/DeepInteract#repository-directory-structure).

DIPS-Plus_Chain_Splits.zip DB5_Chain_Splits.zip

terry-r123 commented 2 years ago

Thank you for your answers and suggestions, which are very useful! @amorehead

terry-r123 commented 2 years ago

Hi, @amorehead.

I was also curious whether using each pair of PDB files (e.g., example_l_b.pdb and example_r_b.pdb) from a multimer (e.g., example.pdb) to predict their contact map in the test was equivalent to knowing the 3D structure and extrapolating contact map.

terry-r123 commented 2 years ago

And how to process the original multimer pdb files (e.g., example.pdb) into pair of PDB files (e.g., example_l_b.pdb and example_r_b.pdb) ? @amorehead

amorehead commented 2 years ago

Hi, @amorehead.

I was also curious whether using each pair of PDB files (e.g., example_l_b.pdb and example_r_b.pdb) from a multimer (e.g., example.pdb) to predict their contact map in the test was equivalent to knowing the 3D structure and extrapolating contact map.

@terry-r123, it depends on whether the input chains being fed into the model are derived directly from a multimer (like you said) or are instead derived from individual tertiary proteins. The former refers to bound interface prediction, which is largely equivalent to contact map extrapolation. However, the latter refers to 'unbound' interface prediction, which can be understood as inferring post-binding contact maps for pairs of proteins that have yet to bind to each other.

amorehead commented 2 years ago

And how to process the original multimer pdb files (e.g., example.pdb) into pair of PDB files (e.g., example_l_b.pdb and example_r_b.pdb) ? @amorehead

@terry-r123, do you have a specific dataset you are interested in processing into a specific data format? If so, I may be able to clarify details regarding your question further.

terry-r123 commented 2 years ago

And how to process the original multimer pdb files (e.g., example.pdb) into pair of PDB files (e.g., example_l_b.pdb and example_r_b.pdb) ? @amorehead

@terry-r123, do you have a specific dataset you are interested in processing into a specific data format? If so, I may be able to clarify details regarding your question further.

@amorehead, such as Deephomo test dataset http://huanglab.phys.hust.edu.cn/deephomo/ . Thanks!

terry-r123 commented 2 years ago

Hi, @amorehead. I was also curious whether using each pair of PDB files (e.g., example_l_b.pdb and example_r_b.pdb) from a multimer (e.g., example.pdb) to predict their contact map in the test was equivalent to knowing the 3D structure and extrapolating contact map.

@terry-r123, it depends on whether the input chains being fed into the model are derived directly from a multimer (like you said) or are instead derived from individual tertiary proteins. The former refers to bound interface prediction, which is largely equivalent to contact map extrapolation. However, the latter refers to 'unbound' interface prediction, which can be understood as inferring post-binding contact maps for pairs of proteins that have yet to bind to each other.

So the training set and test set in Deepinteract are set with 'unbound' interface prediction? If so, how to process the original multimer PDB file (e.g., example.pdb) into unbound's pair of PDB files (e.g., example_l_u.pdb and example_r_u.pdb) files? @amorehead

amorehead commented 2 years ago

@terry-r123,

For your question about DeepHomo, if I recall correctly, this dataset represents each binary complex as a PDB file containing two chains, A and B. In this case, if you want to split each of these DeepHomo PDB files into separate left-bound and right-bound PDB files for chains A and B, respectively, I would recommend using a PDB manipulation tool such as the excellent pdb-tools. Specifically, I believe their documentation on the pdb_selchain function may be relevant for your situation:

pdb_selchain
Extracts one or more chains from a PDB file.

Usage:
    python pdb_selchain.py -<chain id> <pdb file>

Example:
    python pdb_selchain.py -C 1CTF.pdb  # selects chain C
    python pdb_selchain.py -A,C 1CTF.pdb  # selects chains A and C

Regarding your question about 'unbound' interface prediction, in our paper, we describe that DeepInteract's primary training dataset is DIPS-Plus, which consists of 'bound' pairs of protein chains for each complex. Nonetheless, we evaluate the models included in our benchmarks for 'unbound' interface prediction using the DB5-Plus dataset to make sure the models are generalizing to the harder case of inferring inter-chain contact points prior to protein-protein docking.

terry-r123 commented 2 years ago

@amorehead ,

Thank you very much for the Q&A and the PDB manipulation tool, I get it!

terry-r123 commented 2 years ago

@terry-r123,

For your question about DeepHomo, if I recall correctly, this dataset represents each binary complex as a PDB file containing two chains, A and B. In this case, if you want to split each of these DeepHomo PDB files into separate left-bound and right-bound PDB files for chains A and B, respectively, I would recommend using a PDB manipulation tool such as the excellent pdb-tools. Specifically, I believe their documentation on the pdb_selchain function may be relevant for your situation:

pdb_selchain
Extracts one or more chains from a PDB file.

Usage:
    python pdb_selchain.py -<chain id> <pdb file>

Example:
    python pdb_selchain.py -C 1CTF.pdb  # selects chain C
    python pdb_selchain.py -A,C 1CTF.pdb  # selects chains A and C

Regarding your question about 'unbound' interface prediction, in our paper, we describe that DeepInteract's primary training dataset is DIPS-Plus, which consists of 'bound' pairs of protein chains for each complex. Nonetheless, we evaluate the models included in our benchmarks for 'unbound' interface prediction using the DB5-Plus dataset to make sure the models are generalizing to the harder case of inferring inter-chain contact points prior to protein-protein docking.

When using DB5 for evaluation, how are the ground-truth contact labels obtained? Is it directly from the unbound complex?

amorehead commented 2 years ago

@terry-r123, the ground-truth contact labels (following many previous works on interchain contact prediction) are obtained directly from the bound version of each complex. In doing so, our machine learning task in a certain sense becomes one of predicting how two protein chains will alter their structures upon being docked together. In other words, our labels seek to guide the model to learn where two unbound proteins are likely to interact after binding (thus, we derive our ground-truth contact points from the bound version of a complex).

onlyonewater commented 2 years ago

oh, @amorehead , can you tell me where is the ground-truth of the dataset?

onlyonewater commented 2 years ago

and I found that CASP-CAPRI-19 processed dataset you provided is empty, can you provide another one?

image
onlyonewater commented 2 years ago

and when I want to read CASP-CAPRI-19 raw dataset you provided, the error has shown:

image
terry-r123 commented 2 years ago

@terry-r123, the ground-truth contact labels (following many previous works on interchain contact prediction) are obtained directly from the bound version of each complex. In doing so, our machine learning task in a certain sense becomes one of predicting how two protein chains will alter their structures upon being docked together. In other words, our labels seek to guide the model to learn where two unbound proteins are likely to interact after binding (thus, we derive our ground-truth contact points from the bound version of a complex).

What about the db5 test set, are its ground-truth contact labels also from the bound complex?

terry-r123 commented 2 years ago

@amorehead , and when I process 'raw' .dill files to 'processed' .dill, the error has shown:

Using backend: pytorch Global seed set to 42 datasets/CASP_CAPRI/final/raw/pairs-postprocessed-test.txt False 0 0 cp/6cp8.pdb_0.dill 1 cw/7cwp.pdb_0.dill 2 d2/6d2v.pdb_0.dill 3 d7/6d7y.pdb_0.dill 4 e4/6e4b.pdb_0.dill 5 fx/6fxa.pdb_0.dill 6 hr/6hrh.pdb_0.dill 7 m5/7m5f.pdb_0.dill 8 mx/6mxv.pdb_0.dill 9 n6/6n64.pdb_0.dill 10 n9/6n91.pdb_0.dill 11 nq/6nq1.pdb_0.dill 12 qe/6qek.pdb_0.dill 13 tr/6tri.pdb_0.dill 14 ub/6ubl.pdb_0.dill 15 uk/6uk5.pdb_0.dill 16 w6/5w6l.pdb_0.dill 17 xo/6xod.pdb_0.dill 18 ya/6ya2.pdb_0.dill project/datasets/CASP_CAPRI/final/processed/cp/6cp8.pdb_0.dill Traceback (most recent call last): File "lit_model_test.py", line 182, in main(args) File "lit_model_test.py", line 46, in main picp_data_module.setup() File "/home/user/miniconda/lib/python3.8/site-packages/pytorch_lightning/core/datamodule.py", line 428, in wrapped_fn fn(*args, **kwargs) File "/ryc/DeepInteract/project/datasets/PICP/picp_dgl_data_module.py", line 98, in setup self.casp_capri_test = CASPCAPRIDGLDataset(mode='test', raw_dir=self.casp_capri_data_dir, knn=self.knn, File "/ryc/DeepInteract/project/datasets/CASP_CAPRI/casp_capri_dgl_dataset.py", line 120, in init self.process() File "/ryc/DeepInteract/project/datasets/CASP_CAPRI/casp_capri_dgl_dataset.py", line 165, in process process_complex_into_dict(raw_filepath, processed_filepath, self.knn, File "/ryc/DeepInteract/project/utils/deepinteract_utils.py", line 935, in process_complex_into_dict graph1 = convert_df_to_dgl_graph(all_atom_df0, raw_filepath, knn, geo_nbrhd_size, self_loops) File "/ryc/DeepInteract/project/utils/deepinteract_utils.py", line 469, in convert_df_to_dgl_graph edges_transformed = edges[1].reshape(1, len(struct_df), knn) RuntimeError: shape '[1, 157, 20]' is invalid for input of size 2983

amorehead commented 2 years ago

@terry-r123, the ground-truth contact labels (following many previous works on interchain contact prediction) are obtained directly from the bound version of each complex. In doing so, our machine learning task in a certain sense becomes one of predicting how two protein chains will alter their structures upon being docked together. In other words, our labels seek to guide the model to learn where two unbound proteins are likely to interact after binding (thus, we derive our ground-truth contact points from the bound version of a complex).

What about the db5 test set, are its ground-truth contact labels also from the bound complex?

@terry-r123, yes, the ground-truth contact labels for the DB5 test dataset are derived from bound complexes.

onlyonewater commented 2 years ago

@amorehead , I found the CASP-CAPRI-19 processed dataset is empty, can you upload a new link for downloading the CASP-CAPRI-19 processed dataset? and I found when reading the CASP-CAPRI-19 raw dataset, an error has shown: image

amorehead commented 2 years ago

@onlyonewater,

I will look into the empty processed dataset issue you mentioned. In the meantime, the ModuleNotFoundError you see here most likely comes from not having atom3-py3 installed in your Conda environment. I recommend trying to run pip3 install atom3-py3==0.1.9.8 to resolve it

onlyonewater commented 2 years ago

ok, thanks!! And does your raw dataset contain the ground-truth of the contact map?

onlyonewater commented 2 years ago

ok, now I can load CASP-CAPRI-19 raw dataset successfully! and How can I get the ground-truth label for CASP-CAPRI-19 dataset?

onlyonewater commented 2 years ago

ok, now, I know how to get ground-truth label from raw data

terry-r123 commented 2 years ago

@terry-r123, the ground-truth contact labels (following many previous works on interchain contact prediction) are obtained directly from the bound version of each complex. In doing so, our machine learning task in a certain sense becomes one of predicting how two protein chains will alter their structures upon being docked together. In other words, our labels seek to guide the model to learn where two unbound proteins are likely to interact after binding (thus, we derive our ground-truth contact points from the bound version of a complex).

What about the db5 test set, are its ground-truth contact labels also from the bound complex?

@terry-r123, yes, the ground-truth contact labels for the DB5 test dataset are derived from bound complexes.

Thanks!!!

terry-r123 commented 2 years ago

@amorehead , and when I process 'raw' .dill files to 'processed' .dill, the error has shown:

Using backend: pytorch Global seed set to 42 datasets/CASP_CAPRI/final/raw/pairs-postprocessed-test.txt False 0 0 cp/6cp8.pdb_0.dill 1 cw/7cwp.pdb_0.dill 2 d2/6d2v.pdb_0.dill 3 d7/6d7y.pdb_0.dill 4 e4/6e4b.pdb_0.dill 5 fx/6fxa.pdb_0.dill 6 hr/6hrh.pdb_0.dill 7 m5/7m5f.pdb_0.dill 8 mx/6mxv.pdb_0.dill 9 n6/6n64.pdb_0.dill 10 n9/6n91.pdb_0.dill 11 nq/6nq1.pdb_0.dill 12 qe/6qek.pdb_0.dill 13 tr/6tri.pdb_0.dill 14 ub/6ubl.pdb_0.dill 15 uk/6uk5.pdb_0.dill 16 w6/5w6l.pdb_0.dill 17 xo/6xod.pdb_0.dill 18 ya/6ya2.pdb_0.dill project/datasets/CASP_CAPRI/final/processed/cp/6cp8.pdb_0.dill Traceback (most recent call last): File "lit_model_test.py", line 182, in main(args) File "lit_model_test.py", line 46, in main picp_data_module.setup() File "/home/user/miniconda/lib/python3.8/site-packages/pytorch_lightning/core/datamodule.py", line 428, in wrapped_fn fn(*args, kwargs) File "/ryc/DeepInteract/project/datasets/PICP/picp_dgl_data_module.py", line 98, in setup self.casp_capri_test = CASPCAPRIDGLDataset(mode='test', raw_dir=self.casp_capri_data_dir, knn=self.knn, File "/ryc/DeepInteract/project/datasets/CASP_CAPRI/casp_capri_dgl_dataset.py", line 120, in init** self.process() File "/ryc/DeepInteract/project/datasets/CASP_CAPRI/casp_capri_dgl_dataset.py", line 165, in process process_complex_into_dict(raw_filepath, processed_filepath, self.knn, File "/ryc/DeepInteract/project/utils/deepinteract_utils.py", line 935, in process_complex_into_dict graph1 = convert_df_to_dgl_graph(all_atom_df0, raw_filepath, knn, geo_nbrhd_size, self_loops) File "/ryc/DeepInteract/project/utils/deepinteract_utils.py", line 469, in convert_df_to_dgl_graph edges_transformed = edges[1].reshape(1, len(struct_df), knn) RuntimeError: shape '[1, 157, 20]' is invalid for input of size 2983

I have now successfully tried to process raw dill files into processed dill files!

amorehead commented 2 years ago

@terry-r123 and @onlyonewater,

Currently, my availability to debug these errors is more limited than I would like. I will do my best to return to your questions as soon as I can

terry-r123 commented 2 years ago

@terry-r123 and @onlyonewater,

Currently, my availability to debug these errors is more limited than I would like. I will do my best to return to your questions as soon as I can

Thanks!!!

amorehead commented 2 years ago

@terry-r123 and @onlyonewater,

I just finished reprocessing the CASP-CAPRI proteins into Python dictionary files which contain the DGL graphs necessary for model inference/testing. Somehow the original copies of these files were silently corrupted on my storage backups. You can find the direct download link for these processed dictionary files on Zenodo. Let me know if you have any questions about these new files.

terry-r123 commented 2 years ago

@terry-r123 and @onlyonewater,

I just finished reprocessing the CASP-CAPRI proteins into Python dictionary files which contain the DGL graphs necessary for model inference/testing. Somehow the original copies of these files were silently corrupted on my storage backups. You can find the direct download link for these processed dictionary files on Zenodo. Let me know if you have any questions about these new files.

Thanks for your reply and work!