preparing data for training MaSIF

rubenalv commented 1 year ago

[main question] Hello, could you give a hint as to how to prepare the data for training MaSIF/dMaSIF? The input for the training is a single-chain pdb (eg pdbID1_chainA.pdb and pdbID1_chainB.pdb) and a text list of pairs (eg. "pdbID1_chainA_chainB2"). The DIPS-plus contains features or processed pairs within the .dill files. I have thought of taking the pairs names from the dill files and then splitting the raw pdbs accordingly, but I was wondering if there is an easier way that better uses your native pipeline.

[not related to DIPS-plus, cannot ask to answer] Also, are the training pairs supposed to be interacting chains only (ie if a complex has 4 chains and they interact as 1-2 2-3 3-4, is a pair 1-4 valid?). I realise they are independent questions, but I am new to machine learning and do not yet have a high level python, and just getting dmasif to do training and inference on my own data has been a feat... so I thought retraining it with your marvellous DIPS-plus was a good idea.

[Edit] Now I look at the issues, I'll take the route suggested. Please feel free not to answer this issue unless you see something that really needs answering!

rubenalv commented 1 year ago

I wrote an example script to extract one chain from the .dill file and save it as pdb, in case anyone would like to use it as starting point. I was not familiar with the format of the pandas dataframes and where they came from, so even if it was obvious for those in the field, I hope this helps. Great thanks for building a curated database like this.

from Bio.PDB.PDBIO import PDBIO                                # i/o for structures
from atom3.structure import residue_to_pdbresidue, df_to_pdb   # PDBIO wrapper
from dill import load as dload                                 # to read the .dill files

with open("1h9r.pdb1_0.dill", 'rb') as p:
 pdb=dload(p)

pdbst=pdb.df0                         # get first chain (the .dill contains the pair of chains)
pdbst=df_to_pdb(pdbst)                # convert to Structure class
io=PDBIO()                            # create the i/o object
io.set_structure(pdbst)               # assign the structure
io.save("bio-pdb-pdbio-out.pdb")      # save as pdb

orange2350 commented 1 year ago

Hi, I would also like to learn to use that method, have you solved it yet? How can I convert 2dj5.pdb1_0, to a data format that is recognized by masif? thanks very much

rubenalv commented 1 year ago

@orange2350, once you have converted the .dill to .pdb, remember to protonate the pdb structures to ensure correct embedding of the charge (the protonation code is in the masif github).

I used this to convert formats:

from Bio.PDB.PDBIO import PDBIO
from Bio.PDB.Structure import Structure as BioStructure
from Bio.PDB.Model import Model as BioModel
from Bio.PDB.Chain import Chain as BioChain
from atom3.structure import residue_to_pdbresidue
from dill import load as dload
from glob import glob
from re import sub

def df_to_pdb(df_in):
    """Convert df to pdb."""
    df = df_in.copy()
    new_structure = BioStructure('')
    for (model, m_atoms) in df.groupby('model', sort = False):
        new_model = BioModel(model)
        for (chain, c_atoms) in m_atoms.groupby('chain', sort = False):
            new_chain = BioChain(chain)
            for (residue, r_atoms) in c_atoms.groupby('residue', sort = False):
                new_residue = residue_to_pdbresidue(residue, r_atoms)
                new_chain.add(new_residue)
            new_model.add(new_chain)
        new_structure.add(new_model)
    return new_structure

def pdbname(pdb, df):
    out_pdb   = ''.join(set(pdb[df]['pdb_name']))
    out_pdb   = sub("\\..*", "", out_pdb).upper()
    out_chain = out_pdb + '_' + ''.join(set(pdb[df]['chain']))
    out       = [out_pdb, out_chain]
    return(out)

def savepdbs(pdb):
    # extract the pdb name and chain
    pdb1 = pdbname(pdb, 1)
    pdb2 = pdbname(pdb, 2)
    if pdb1[0] !=  pdb2[0]:
        raise Exception("The pdb names do not match between the chains in" + pdb1[1] + ";" + pdb2[1])
    if pdb1[1] ==  pdb2[1]:
        raise Exception("The pdb and chain names are identical in" + pdb1[1] + ";" + pdb2[1])

    # make the paired chain name
    pdb_pair = pdb1[1] + sub(".*_", "_", pdb2[1])

    # get and save the first chain (the .dill file contains two chains)
    pdbst=df_to_pdb(pdb.df0)
    io=PDBIO()
    io.set_structure(pdbst)
    io.save(pdb1[1]+".pdb")

    # get and save the second chain
    pdbst=df_to_pdb(pdb.df1)
    io=PDBIO()
    io.set_structure(pdbst)
    io.save(pdb2[1]+".pdb")

    return(pdb_pair)

# get all .dill files
dillfiles = glob("../complexes/pairs-pruned/**/*.dill", recursive = True) ## hardcoded

# convert each dill into dmasif format
dmasif_pairs = []
for f in dillfiles:
    with open(f, 'rb') as p:
        pdb=dload(p)
    dmasif_pairs.append(savepdbs(pdb))

with open("pdb_pairs_list.txt", "a") as f:
    [f.write(p+'\n') for p in dmasif_pairs]

## there are a few pdbs that contain the ATOMs "XD1" and "XD2" in the ASX (asparagine or aspartate), and throw a problem when converted to numpy arrays (function load_pdb), so delete them
# rm 1KP0_A.pdb 1KP0_B.pdb 2ATC_A.pdb

#  warnings.warn(msg, PDBConstructionWarning)
#**[...]**Bio/PDB/Atom.py:218: PDBConstructionWarning: Could not assign element 'X' for Atom (name=XD2) with given element ''
#  warnings.warn(msg, PDBConstructionWarning)
#**[...]**Bio/PDB/PDBParser.py:395: PDBConstructionWarning: Ignoring unrecognized record 'END' at line 4818
#  warnings.warn(
#27422it [16:22, 27.90it/s]
#Traceback (most recent call last):
#  File "<stdin>", line 1, in <module>
#  File "<stdin>", line 4, in convert_pdbs
#  File "<stdin>", line 12, in load_structure_np
# KeyError: ''

And used this to convert the train/test sets into the masif format. I found duplicated pairs but just removed them. Perhaps you would not mind taking a look and opening an issue here, if the duplicates are indeed in the DIPS-plus?

########## script2 ########
## get the training/validation sets from the .dill pairs in pairs-postprocessed-val.txt and pairs-postprocessed-train.txt
from dill import load as dload
from re import sub

def pdbname(pdb, df):
        out_pdb   = ''.join(set(pdb[df]['pdb_name']))
        out_pdb   = sub("\\..*", "", out_pdb).upper()
        out_chain = out_pdb + '_' + ''.join(set(pdb[df]['chain']))
        out       = [out_pdb, out_chain]
        return(out)

with open("pairs-postprocessed-val.txt", "r") as f:
    val=[l for l in f.read().splitlines()]

with open("pairs-postprocessed-train.txt", "r") as f:
    train=[l for l in f.read().splitlines()]

# get all .dill files
dmasif_train = []
for f in train:
    with open(f, 'rb') as p:
        pdb=dload(p)
    pdb1 = pdbname(pdb, 1)
    pdb2 = pdbname(pdb, 2)
    pdb_pair = pdb1[1] + sub(".*_", "_", pdb2[1])
    dmasif_train.append(pdb_pair)

dmasif_val = []
for f in val:
    with open(f, 'rb') as p:
        pdb=dload(p)
    pdb1 = pdbname(pdb, 1)
    pdb2 = pdbname(pdb, 2)
    pdb_pair = pdb1[1] + sub(".*_", "_", pdb2[1])
    dmasif_val.append(pdb_pair)

##### need to update code to remove duplicate pairs!!

#with open("pairs_postprocessed_train_dmasif.txt", "w") as f:
#    [f.write(i+"\n") for i in dmasif_train]

#with open("pairs_postprocessed_val_dmasif.txt", "w") as f:
#    [f.write(i+"\n") for i in dmasif_val]

orange2350 commented 1 year ago

@rubenalv Thank you very much, I made it！！！：）

BioinfoMachineLearning / DIPS-Plus

preparing data for training MaSIF #22