Data Preparation for HOLO4K

hi7049 commented 2 years ago

Hi, I have some questions about data preparation.

In your paper, you mentioned that "The proteins and ligands were separated from the corresponding structure files using the Biopython library". But I can't find corresponding codes in this repo, could you share those parts of codes?
A pdb file in HOLO4K may have several ligands, do you remain all ligands or remove some? What are the criteria to choose ligands in a pdb file?
When you use Fpocket to choose pocket candidates, do you run Fpocket on the original pdb file, pdb file without ligands, or a single chain in pdb file?

RishalAggarwal commented 2 years ago

import os
from Bio.PDB import PDBParser, PDBIO, Select, Polypeptide
f=open('holo4k.txt','r')
dir='/scratch/rishal/pbsp/data/'
parser = PDBParser()

class LigSelect(Select):
    def __init__(self,ligand):
        self.ligand=ligand
    def accept_residue(self, residue):
        if residue.get_resname() == ligand and not Polypeptide.is_aa(residue,standard=True):
            return 1
        else:
            return 0
for line in f:
    ligand_num=-1
    if not os.path.exists(os.path.join(dir,line.split()[0].split('.pdb')[0].replace('holo4k','holo4k/Dataset'))):
        print(line)
        continue
    for ligand in line.split()[1].split(','):
        ligand_num+=1
        structure = parser.get_structure("protein", os.path.join(dir,line.split()[0]))
        io = PDBIO()
        io.set_structure(structure)
        io.save(os.path.join(dir,line.split()[0].split('.pdb')[0].replace('holo4k','holo4k/Dataset'),'ligand_new'+str(ligand_num)+'.pdb'),LigSelect(ligand))

You can get "holo4k.txt" from here: https://github.com/rdk/p2rank-datasets/blob/master/holo4k(mlig).ds
We remove all the ligands in the file (they are referred to as hetero atoms in the pdb file)
It is run on pdb file without ligands

hi7049 commented 2 years ago

Thank you for your reply and codes. I understand it now.

mainguyenanhvu commented 1 year ago

@hi7049 have you re-run data preparation for a custom data? If yes, please help me.

I am trying to use the instruction to prepare data for training a new classifier. I have stuck in make_types step because I can't find train.txt and test.txt files.

Moreover, I have 4 questions:

If I want to add several pdb files to the available scPDB dataset, how can I complete it?
The instruction for preparing data only works for a single pdb file, does it? If not, I need to write a pipeline to wrap up it.
How to prepare train.txt and test.txt files to run make_types.py?
Could you please show me which file/folder needed inputting from previous to each step?

I am tried on this pdb.

Thank you very much.

devalab / DeepPocket

Data Preparation for HOLO4K #10