Do not have files for running make_types.py when prerparing custom data for training a new classifier

mainguyenanhvu commented 1 year ago

I am trying to use your instruction to prepare data for training a new classifier. I have stuck in make_types step because I can't find train.txt and test.txt files.

Moreover, I have 4 questions:

If I want to add several pdb files to the available scPDB dataset, how can I complete it?
Your instruction for preparing data only works for a single pdb file, does it? If not, I need to write a pipeline to wrap up it.
How to prepare train.txt and test.txt files to run make_types.py?
Could you please show me which file/folder needed inputting from previous to each step?

I am tried on this pdb.

Thank you very much.

RishalAggarwal commented 1 year ago

Hey, thanks for your interest.

1) To add more files to the scpdb dataset you will have to create new types/molcache files for training 2) The first 4 steps are for a single pdb yes. 3) I believe the train.txt and test.txt files just need to contain the protein-ligand complexes you are training/testing on. 4) I'm not sure what you mean here, but you need to ensure (unfortunately in the scripts) that all the file paths are valid.

Let me know if you are facing any errors in the process.

mainguyenanhvu commented 1 year ago

Thank for your reply. I added a code to deploy data preparation pipeline, and made a pull request. If you have time, please check it whether or not I might make some mistakes.

To reply your answer:

I understand that it contains names of pdb files I would like to use to generate database.
After I read your code, I understood input and output for each step.

Besides, I would like to ask you:

In https://github.com/devalab/DeepPocket/blob/main/make_types.py, you use a _ligand.sdf file. I would like to know what this file contains and how to create it. I tried to create it from orignal pdb file by only exporting ligand. Could you please send me some pairs of pdb and sdf file, so I can understand easily?
If bary_centers.txt is none after running get_centers.py, how can I fix it?
I would like to have your prepared dataset (scPDB, ...), could you please send me?

When I run types_gninatyper.py, it warns these information, how can I fix it?

==============================
*** Open Babel Warning  in parseAtomRecord
WARNING: Problems reading a PDB file
Problems reading a HETATM or ATOM record.
According to the PDB specification,
columns 79-80 should contain charge of the atom
but OpenBabel found ' 0' (atom 5489).
==============================
*** Open Babel Warning  in parseAtomRecord
WARNING: Problems reading a PDB file
Problems reading a HETATM or ATOM record.
According to the PDB specification,
columns 79-80 should contain charge of the atom
but OpenBabel found ' 0' (atom 5490).

Please help me. Thank you very much.

RishalAggarwal commented 1 year ago

Thank you for the pull request, I will check it when I get more time.

yes the *_ligand.sdf file contains only the ligand coordinates - you can extract them as the hetatom records from any pdb file (be sure not to include waters)
If barycenter.txt is empty, it probably means fpocket did not identify a pocket in that protein, since deeppocket is dependent on pockets found by fpocket, theres no fix for this.
Datasets are available in the link provided on the README.md, is there anything in particular you are looking for?
The warning is due to openbabel but safe to ignore.

devalab / DeepPocket

Do not have files for running make_types.py when prerparing custom data for training a new classifier #26