Closed terry-r123 closed 2 years ago
Hi, @terry-r123.
Sure thing! In terms of how to process them into .dill
graph files, I believe a quick solution for these files could be to use our Docker inference pipeline on each pair of files (e.g., example_l_b.pdb
and example_r_b.pdb
) to convert them into graph files in your intermediate working directory of choice (e.g., datasets/Input
). If you would like to speed things up even more quickly (e.g., by using multiple CPU cores in parallel), then I recommend you look at following the traditional setup guide for DeepInteract, as with a traditional installation you can then run our original graph processing script (assuming you are using the same directory structure we list here: https://github.com/BioinfoMachineLearning/DeepInteract#repository-directory-structure).
Thank you for your answers and suggestions, which are very useful! @amorehead
Hi, @amorehead.
I was also curious whether using each pair of PDB files (e.g., example_l_b.pdb and example_r_b.pdb) from a multimer (e.g., example.pdb) to predict their contact map in the test was equivalent to knowing the 3D structure and extrapolating contact map.
And how to process the original multimer pdb files (e.g., example.pdb) into pair of PDB files (e.g., example_l_b.pdb and example_r_b.pdb) ? @amorehead
Hi, @amorehead.
I was also curious whether using each pair of PDB files (e.g., example_l_b.pdb and example_r_b.pdb) from a multimer (e.g., example.pdb) to predict their contact map in the test was equivalent to knowing the 3D structure and extrapolating contact map.
@terry-r123, it depends on whether the input chains being fed into the model are derived directly from a multimer (like you said) or are instead derived from individual tertiary proteins. The former refers to bound
interface prediction, which is largely equivalent to contact map extrapolation. However, the latter refers to 'unbound' interface prediction, which can be understood as inferring post-binding
contact maps for pairs of proteins that have yet to bind to each other.
And how to process the original multimer pdb files (e.g., example.pdb) into pair of PDB files (e.g., example_l_b.pdb and example_r_b.pdb) ? @amorehead
@terry-r123, do you have a specific dataset you are interested in processing into a specific data format? If so, I may be able to clarify details regarding your question further.
And how to process the original multimer pdb files (e.g., example.pdb) into pair of PDB files (e.g., example_l_b.pdb and example_r_b.pdb) ? @amorehead
@terry-r123, do you have a specific dataset you are interested in processing into a specific data format? If so, I may be able to clarify details regarding your question further.
@amorehead, such as Deephomo test dataset http://huanglab.phys.hust.edu.cn/deephomo/ . Thanks!
Hi, @amorehead. I was also curious whether using each pair of PDB files (e.g., example_l_b.pdb and example_r_b.pdb) from a multimer (e.g., example.pdb) to predict their contact map in the test was equivalent to knowing the 3D structure and extrapolating contact map.
@terry-r123, it depends on whether the input chains being fed into the model are derived directly from a multimer (like you said) or are instead derived from individual tertiary proteins. The former refers to
bound
interface prediction, which is largely equivalent to contact map extrapolation. However, the latter refers to 'unbound' interface prediction, which can be understood as inferringpost-binding
contact maps for pairs of proteins that have yet to bind to each other.
So the training set and test set in Deepinteract are set with 'unbound' interface prediction? If so, how to process the original multimer PDB file (e.g., example.pdb) into unbound's pair of PDB files (e.g., example_l_u.pdb and example_r_u.pdb) files? @amorehead
@terry-r123,
For your question about DeepHomo, if I recall correctly, this dataset represents each binary complex as a PDB file containing two chains, A and B. In this case, if you want to split each of these DeepHomo PDB files into separate left-bound and right-bound PDB files for chains A and B, respectively, I would recommend using a PDB manipulation tool such as the excellent pdb-tools. Specifically, I believe their documentation on the pdb_selchain
function may be relevant for your situation:
pdb_selchain
Extracts one or more chains from a PDB file.
Usage:
python pdb_selchain.py -<chain id> <pdb file>
Example:
python pdb_selchain.py -C 1CTF.pdb # selects chain C
python pdb_selchain.py -A,C 1CTF.pdb # selects chains A and C
Regarding your question about 'unbound' interface prediction, in our paper, we describe that DeepInteract's primary training dataset is DIPS-Plus, which consists of 'bound' pairs of protein chains for each complex. Nonetheless, we evaluate the models included in our benchmarks for 'unbound' interface prediction using the DB5-Plus dataset to make sure the models are generalizing to the harder case of inferring inter-chain contact points prior to protein-protein docking.
@amorehead ,
Thank you very much for the Q&A and the PDB manipulation tool, I get it!
@terry-r123,
For your question about DeepHomo, if I recall correctly, this dataset represents each binary complex as a PDB file containing two chains, A and B. In this case, if you want to split each of these DeepHomo PDB files into separate left-bound and right-bound PDB files for chains A and B, respectively, I would recommend using a PDB manipulation tool such as the excellent pdb-tools. Specifically, I believe their documentation on the
pdb_selchain
function may be relevant for your situation:pdb_selchain Extracts one or more chains from a PDB file. Usage: python pdb_selchain.py -<chain id> <pdb file> Example: python pdb_selchain.py -C 1CTF.pdb # selects chain C python pdb_selchain.py -A,C 1CTF.pdb # selects chains A and C
Regarding your question about 'unbound' interface prediction, in our paper, we describe that DeepInteract's primary training dataset is DIPS-Plus, which consists of 'bound' pairs of protein chains for each complex. Nonetheless, we evaluate the models included in our benchmarks for 'unbound' interface prediction using the DB5-Plus dataset to make sure the models are generalizing to the harder case of inferring inter-chain contact points prior to protein-protein docking.
When using DB5 for evaluation, how are the ground-truth contact labels obtained? Is it directly from the unbound complex?
@terry-r123, the ground-truth contact labels (following many previous works on interchain contact prediction) are obtained directly from the bound version of each complex. In doing so, our machine learning task in a certain sense becomes one of predicting how two protein chains will alter their structures upon being docked together. In other words, our labels seek to guide the model to learn where two unbound proteins are likely to interact after binding (thus, we derive our ground-truth contact points from the bound version of a complex).
oh, @amorehead , can you tell me where is the ground-truth of the dataset?
and I found that CASP-CAPRI-19 processed dataset you provided is empty, can you provide another one?
and when I want to read CASP-CAPRI-19 raw dataset you provided, the error has shown:
@terry-r123, the ground-truth contact labels (following many previous works on interchain contact prediction) are obtained directly from the bound version of each complex. In doing so, our machine learning task in a certain sense becomes one of predicting how two protein chains will alter their structures upon being docked together. In other words, our labels seek to guide the model to learn where two unbound proteins are likely to interact after binding (thus, we derive our ground-truth contact points from the bound version of a complex).
What about the db5 test set, are its ground-truth contact labels also from the bound complex?
@amorehead , and when I process 'raw' .dill files to 'processed' .dill, the error has shown:
Using backend: pytorch
Global seed set to 42
datasets/CASP_CAPRI/final/raw/pairs-postprocessed-test.txt
False
0
0 cp/6cp8.pdb_0.dill
1 cw/7cwp.pdb_0.dill
2 d2/6d2v.pdb_0.dill
3 d7/6d7y.pdb_0.dill
4 e4/6e4b.pdb_0.dill
5 fx/6fxa.pdb_0.dill
6 hr/6hrh.pdb_0.dill
7 m5/7m5f.pdb_0.dill
8 mx/6mxv.pdb_0.dill
9 n6/6n64.pdb_0.dill
10 n9/6n91.pdb_0.dill
11 nq/6nq1.pdb_0.dill
12 qe/6qek.pdb_0.dill
13 tr/6tri.pdb_0.dill
14 ub/6ubl.pdb_0.dill
15 uk/6uk5.pdb_0.dill
16 w6/5w6l.pdb_0.dill
17 xo/6xod.pdb_0.dill
18 ya/6ya2.pdb_0.dill
project/datasets/CASP_CAPRI/final/processed/cp/6cp8.pdb_0.dill
Traceback (most recent call last):
File "lit_model_test.py", line 182, in
@terry-r123, the ground-truth contact labels (following many previous works on interchain contact prediction) are obtained directly from the bound version of each complex. In doing so, our machine learning task in a certain sense becomes one of predicting how two protein chains will alter their structures upon being docked together. In other words, our labels seek to guide the model to learn where two unbound proteins are likely to interact after binding (thus, we derive our ground-truth contact points from the bound version of a complex).
What about the db5 test set, are its ground-truth contact labels also from the bound complex?
@terry-r123, yes, the ground-truth contact labels for the DB5 test dataset are derived from bound complexes.
@amorehead , I found the CASP-CAPRI-19 processed dataset is empty, can you upload a new link for downloading the CASP-CAPRI-19 processed dataset? and I found when reading the CASP-CAPRI-19 raw dataset, an error has shown:
@onlyonewater,
I will look into the empty processed dataset issue you mentioned. In the meantime, the ModuleNotFoundError you see here most likely comes from not having atom3-py3
installed in your Conda environment. I recommend trying to run pip3 install atom3-py3==0.1.9.8
to resolve it
ok, thanks!! And does your raw dataset contain the ground-truth of the contact map?
ok, now I can load CASP-CAPRI-19 raw dataset successfully! and How can I get the ground-truth label for CASP-CAPRI-19 dataset?
ok, now, I know how to get ground-truth label from raw data
@terry-r123, the ground-truth contact labels (following many previous works on interchain contact prediction) are obtained directly from the bound version of each complex. In doing so, our machine learning task in a certain sense becomes one of predicting how two protein chains will alter their structures upon being docked together. In other words, our labels seek to guide the model to learn where two unbound proteins are likely to interact after binding (thus, we derive our ground-truth contact points from the bound version of a complex).
What about the db5 test set, are its ground-truth contact labels also from the bound complex?
@terry-r123, yes, the ground-truth contact labels for the DB5 test dataset are derived from bound complexes.
Thanks!!!
@amorehead , and when I process 'raw' .dill files to 'processed' .dill, the error has shown:
Using backend: pytorch Global seed set to 42 datasets/CASP_CAPRI/final/raw/pairs-postprocessed-test.txt False 0 0 cp/6cp8.pdb_0.dill 1 cw/7cwp.pdb_0.dill 2 d2/6d2v.pdb_0.dill 3 d7/6d7y.pdb_0.dill 4 e4/6e4b.pdb_0.dill 5 fx/6fxa.pdb_0.dill 6 hr/6hrh.pdb_0.dill 7 m5/7m5f.pdb_0.dill 8 mx/6mxv.pdb_0.dill 9 n6/6n64.pdb_0.dill 10 n9/6n91.pdb_0.dill 11 nq/6nq1.pdb_0.dill 12 qe/6qek.pdb_0.dill 13 tr/6tri.pdb_0.dill 14 ub/6ubl.pdb_0.dill 15 uk/6uk5.pdb_0.dill 16 w6/5w6l.pdb_0.dill 17 xo/6xod.pdb_0.dill 18 ya/6ya2.pdb_0.dill project/datasets/CASP_CAPRI/final/processed/cp/6cp8.pdb_0.dill Traceback (most recent call last): File "lit_model_test.py", line 182, in main(args) File "lit_model_test.py", line 46, in main picp_data_module.setup() File "/home/user/miniconda/lib/python3.8/site-packages/pytorch_lightning/core/datamodule.py", line 428, in wrapped_fn fn(*args, kwargs) File "/ryc/DeepInteract/project/datasets/PICP/picp_dgl_data_module.py", line 98, in setup self.casp_capri_test = CASPCAPRIDGLDataset(mode='test', raw_dir=self.casp_capri_data_dir, knn=self.knn, File "/ryc/DeepInteract/project/datasets/CASP_CAPRI/casp_capri_dgl_dataset.py", line 120, in init** self.process() File "/ryc/DeepInteract/project/datasets/CASP_CAPRI/casp_capri_dgl_dataset.py", line 165, in process process_complex_into_dict(raw_filepath, processed_filepath, self.knn, File "/ryc/DeepInteract/project/utils/deepinteract_utils.py", line 935, in process_complex_into_dict graph1 = convert_df_to_dgl_graph(all_atom_df0, raw_filepath, knn, geo_nbrhd_size, self_loops) File "/ryc/DeepInteract/project/utils/deepinteract_utils.py", line 469, in convert_df_to_dgl_graph edges_transformed = edges[1].reshape(1, len(struct_df), knn) RuntimeError: shape '[1, 157, 20]' is invalid for input of size 2983
I have now successfully tried to process raw dill files into processed dill files!
@terry-r123 and @onlyonewater,
Currently, my availability to debug these errors is more limited than I would like. I will do my best to return to your questions as soon as I can
@terry-r123 and @onlyonewater,
Currently, my availability to debug these errors is more limited than I would like. I will do my best to return to your questions as soon as I can
Thanks!!!
@terry-r123 and @onlyonewater,
I just finished reprocessing the CASP-CAPRI proteins into Python dictionary files which contain the DGL graphs necessary for model inference/testing. Somehow the original copies of these files were silently corrupted on my storage backups. You can find the direct download link for these processed dictionary files on Zenodo. Let me know if you have any questions about these new files.
@terry-r123 and @onlyonewater,
I just finished reprocessing the CASP-CAPRI proteins into Python dictionary files which contain the DGL graphs necessary for model inference/testing. Somehow the original copies of these files were silently corrupted on my storage backups. You can find the direct download link for these processed dictionary files on Zenodo. Let me know if you have any questions about these new files.
Thanks for your reply and work!
Hi, @amorehead, can you provide the original 32 pdb files for the DIPS-Plus dataset and 55 pdb files for the DB5 dataset? And how to process the original pdb files into pdb.dill files for this project.
Thanks!