chemosim-lab / ProLIF

Interaction Fingerprints for protein-ligand complexes and more
https://prolif.readthedocs.io
Apache License 2.0
337 stars 66 forks source link

Get dataframe with specific protein interactors (atoms instead of residues) as column headers #180

Closed jparcon closed 4 months ago

jparcon commented 6 months ago

Is there a direct way of getting the dataframe from the interaction fingerprint with the protein atomic interactors as headers?

When trying to define "important" interactions (let's say those repeated for many ligands), a column corresponding to a specific Thr for which a ligand shows "HB donor" could mean very different things if the interactor is the carbonyl from the backbone or the hydroxyl from the side chain.

cbouy commented 6 months ago

Hi,

Not a direct way but this script should give what you seem to be looking for:

import pandas as pd

protein_mol = plf.Molecule.from_mda(protein_selection)

data = []
index = []
for i, frame_ifp in fp.ifp.items():
    index.append(i)
    frame_data = {}
    for (ligres, protres), residue_ifp in frame_ifp.items():
        for int_name, metadata_tuple in residue_ifp.items():
            for metadata in metadata_tuple:
                # extract atom name using atom index in metadata and protein mol
                atoms = ",".join(
                    [
                        protein_mol.GetAtomWithIdx(x).GetMonomerInfo().GetName()
                        for x in metadata["parent_indices"]["protein"]
                    ]
                )
                frame_data[(str(ligres), str(protres), int_name, atoms)] = 1
    data.append(frame_data)

df = pd.DataFrame(data, index=pd.Index(index, name="Frame"))
df.columns = pd.MultiIndex.from_tuples(
    df.columns, names=["ligand", "protein", "interaction", "atoms"]
)
df = df.fillna(0).astype(int)
df

It gives the dataframe below, where atoms is the new header level containing the protein atoms involved in the corresponding interaction

ligand           LIG1.G                         ...                         
protein        TRP125.A   VAL200.A    THR209.A  ...     TYR38.A   VAL200.A   
interaction Hydrophobic VdWContact Hydrophobic  ... Hydrophobic VdWContact   
atoms               CZ3       HG11         CG2  ...         CD2        HB    
Frame                                           ...                          
0                     1          1           1  ...           0          0   
10                    1          0           0  ...           0          0   
...                 ...        ...         ...  ...         ...        ...   
230                   0          0           0  ...           1          0   
240                   0          0           0  ...           0          1 
jparcon commented 6 months ago

This is so cool, thank you very much!

Now, when calculating the percentage of poses where each interaction is present, e.g. hydrophobic or ionic interactions become repeated for each atomic interactor (something that in principle would be only desirable for hydrogen bonds). So I made a slight change to get an additional data frame with merged interactors for interactions different from H bonds. I'm posting it in case someone finds it useful too.

# get atomic separated dataframe, only for H bonds
data = []
index = []
for i, frame_ifp in fp.ifp.items():
    index.append(i)
    frame_data = {}
    for (ligres, protres), residue_ifp in frame_ifp.items():
        for int_name, metadata_tuple in residue_ifp.items():
            if int_name in ["HBAcceptor", "HBDonor"]:
                for metadata in metadata_tuple:
                    # extract atom name using atom index in metadata and protein mol
                    atoms = ",".join(
                        [
                            protein_mol.GetAtomWithIdx(x).GetMonomerInfo().GetName()
                            for x in metadata["parent_indices"]["protein"]
                        ]
                    )
                    frame_data[(str(protres), int_name, atoms.replace(" ", ""))] = 1
            else:
                frame_data[(str(protres), int_name, "X")] = 1
    data.append(frame_data)

df_merge = pd.DataFrame(data, index=pd.Index(index, name="Pose"))
df_merge.columns = pd.MultiIndex.from_tuples(
    df_merge.columns, names=["protein_res", "lig_interaction", "protein_ats"]
)
df_merge = df_merge.fillna(0).astype(int)
df_merge.sort_index(axis=1, level="protein_res", ascending=True, inplace=True)
df_merge.head(5)