Closed rubenalv closed 1 year ago
I wrote an example script to extract one chain from the .dill file and save it as pdb, in case anyone would like to use it as starting point. I was not familiar with the format of the pandas dataframes and where they came from, so even if it was obvious for those in the field, I hope this helps. Great thanks for building a curated database like this.
from Bio.PDB.PDBIO import PDBIO # i/o for structures
from atom3.structure import residue_to_pdbresidue, df_to_pdb # PDBIO wrapper
from dill import load as dload # to read the .dill files
with open("1h9r.pdb1_0.dill", 'rb') as p:
pdb=dload(p)
pdbst=pdb.df0 # get first chain (the .dill contains the pair of chains)
pdbst=df_to_pdb(pdbst) # convert to Structure class
io=PDBIO() # create the i/o object
io.set_structure(pdbst) # assign the structure
io.save("bio-pdb-pdbio-out.pdb") # save as pdb
Hi, I would also like to learn to use that method, have you solved it yet? How can I convert 2dj5.pdb1_0, to a data format that is recognized by masif? thanks very much
@orange2350, once you have converted the .dill to .pdb, remember to protonate the pdb structures to ensure correct embedding of the charge (the protonation code is in the masif github).
I used this to convert formats:
from Bio.PDB.PDBIO import PDBIO
from Bio.PDB.Structure import Structure as BioStructure
from Bio.PDB.Model import Model as BioModel
from Bio.PDB.Chain import Chain as BioChain
from atom3.structure import residue_to_pdbresidue
from dill import load as dload
from glob import glob
from re import sub
def df_to_pdb(df_in):
"""Convert df to pdb."""
df = df_in.copy()
new_structure = BioStructure('')
for (model, m_atoms) in df.groupby('model', sort = False):
new_model = BioModel(model)
for (chain, c_atoms) in m_atoms.groupby('chain', sort = False):
new_chain = BioChain(chain)
for (residue, r_atoms) in c_atoms.groupby('residue', sort = False):
new_residue = residue_to_pdbresidue(residue, r_atoms)
new_chain.add(new_residue)
new_model.add(new_chain)
new_structure.add(new_model)
return new_structure
def pdbname(pdb, df):
out_pdb = ''.join(set(pdb[df]['pdb_name']))
out_pdb = sub("\\..*", "", out_pdb).upper()
out_chain = out_pdb + '_' + ''.join(set(pdb[df]['chain']))
out = [out_pdb, out_chain]
return(out)
def savepdbs(pdb):
# extract the pdb name and chain
pdb1 = pdbname(pdb, 1)
pdb2 = pdbname(pdb, 2)
if pdb1[0] != pdb2[0]:
raise Exception("The pdb names do not match between the chains in" + pdb1[1] + ";" + pdb2[1])
if pdb1[1] == pdb2[1]:
raise Exception("The pdb and chain names are identical in" + pdb1[1] + ";" + pdb2[1])
# make the paired chain name
pdb_pair = pdb1[1] + sub(".*_", "_", pdb2[1])
# get and save the first chain (the .dill file contains two chains)
pdbst=df_to_pdb(pdb.df0)
io=PDBIO()
io.set_structure(pdbst)
io.save(pdb1[1]+".pdb")
# get and save the second chain
pdbst=df_to_pdb(pdb.df1)
io=PDBIO()
io.set_structure(pdbst)
io.save(pdb2[1]+".pdb")
return(pdb_pair)
# get all .dill files
dillfiles = glob("../complexes/pairs-pruned/**/*.dill", recursive = True) ## hardcoded
# convert each dill into dmasif format
dmasif_pairs = []
for f in dillfiles:
with open(f, 'rb') as p:
pdb=dload(p)
dmasif_pairs.append(savepdbs(pdb))
with open("pdb_pairs_list.txt", "a") as f:
[f.write(p+'\n') for p in dmasif_pairs]
## there are a few pdbs that contain the ATOMs "XD1" and "XD2" in the ASX (asparagine or aspartate), and throw a problem when converted to numpy arrays (function load_pdb), so delete them
# rm 1KP0_A.pdb 1KP0_B.pdb 2ATC_A.pdb
# warnings.warn(msg, PDBConstructionWarning)
#**[...]**Bio/PDB/Atom.py:218: PDBConstructionWarning: Could not assign element 'X' for Atom (name=XD2) with given element ''
# warnings.warn(msg, PDBConstructionWarning)
#**[...]**Bio/PDB/PDBParser.py:395: PDBConstructionWarning: Ignoring unrecognized record 'END' at line 4818
# warnings.warn(
#27422it [16:22, 27.90it/s]
#Traceback (most recent call last):
# File "<stdin>", line 1, in <module>
# File "<stdin>", line 4, in convert_pdbs
# File "<stdin>", line 12, in load_structure_np
# KeyError: ''
And used this to convert the train/test sets into the masif format. I found duplicated pairs but just removed them. Perhaps you would not mind taking a look and opening an issue here, if the duplicates are indeed in the DIPS-plus?
########## script2 ########
## get the training/validation sets from the .dill pairs in pairs-postprocessed-val.txt and pairs-postprocessed-train.txt
from dill import load as dload
from re import sub
def pdbname(pdb, df):
out_pdb = ''.join(set(pdb[df]['pdb_name']))
out_pdb = sub("\\..*", "", out_pdb).upper()
out_chain = out_pdb + '_' + ''.join(set(pdb[df]['chain']))
out = [out_pdb, out_chain]
return(out)
with open("pairs-postprocessed-val.txt", "r") as f:
val=[l for l in f.read().splitlines()]
with open("pairs-postprocessed-train.txt", "r") as f:
train=[l for l in f.read().splitlines()]
# get all .dill files
dmasif_train = []
for f in train:
with open(f, 'rb') as p:
pdb=dload(p)
pdb1 = pdbname(pdb, 1)
pdb2 = pdbname(pdb, 2)
pdb_pair = pdb1[1] + sub(".*_", "_", pdb2[1])
dmasif_train.append(pdb_pair)
dmasif_val = []
for f in val:
with open(f, 'rb') as p:
pdb=dload(p)
pdb1 = pdbname(pdb, 1)
pdb2 = pdbname(pdb, 2)
pdb_pair = pdb1[1] + sub(".*_", "_", pdb2[1])
dmasif_val.append(pdb_pair)
##### need to update code to remove duplicate pairs!!
#with open("pairs_postprocessed_train_dmasif.txt", "w") as f:
# [f.write(i+"\n") for i in dmasif_train]
#with open("pairs_postprocessed_val_dmasif.txt", "w") as f:
# [f.write(i+"\n") for i in dmasif_val]
@rubenalv Thank you very much, I made it!!!:)
[main question] Hello, could you give a hint as to how to prepare the data for training MaSIF/dMaSIF? The input for the training is a single-chain pdb (eg pdbID1_chainA.pdb and pdbID1_chainB.pdb) and a text list of pairs (eg. "pdbID1_chainA_chainB2"). The DIPS-plus contains features or processed pairs within the .dill files. I have thought of taking the pairs names from the dill files and then splitting the raw pdbs accordingly, but I was wondering if there is an easier way that better uses your native pipeline.
[not related to DIPS-plus, cannot ask to answer] Also, are the training pairs supposed to be interacting chains only (ie if a complex has 4 chains and they interact as 1-2 2-3 3-4, is a pair 1-4 valid?). I realise they are independent questions, but I am new to machine learning and do not yet have a high level python, and just getting dmasif to do training and inference on my own data has been a feat... so I thought retraining it with your marvellous DIPS-plus was a good idea.
[Edit] Now I look at the issues, I'll take the route suggested. Please feel free not to answer this issue unless you see something that really needs answering!