BorgwardtLab / proteinshake

Protein structure datasets for machine learning.
https://proteinshake.ai
BSD 3-Clause "New" or "Revised" License
99 stars 8 forks source link

Structure Similarity Task training #273

Open ahariri13 opened 4 months ago

ahariri13 commented 4 months ago

Hello ! I'm still new to learning on proteins and I was wondering how to train on the Structure Similarity Task (at least in an efficient manner) when using the graph format for PyTorch Geometric.

For loading the data i am using the following lines:

"""## Load the task and the dataset"""
datapath = './data/ec'
task = ps_tasks.StructureSimilarityTask(root=datapath)
dset = task.dataset

"""We convert the protein 3D structures to $\epsilon$-graphs ($\epsilon=8$ here):"""

def transform(data):
    data, protein_dict = data
    data.y = protein_dict['protein']['ID']
    return data

dset2 = dset.to_graph(eps=8.0).pyg(transform=transform)

from torch.utils.data import Subset
from torch_geometric.loader import DataLoader

batch_size = args.batch_size
train_loader = DataLoader(Subset(dset2, task.train_index), batch_size=batch_size,shuffle=True, num_workers=0)

val_loader = DataLoader(Subset(dset2, task.val_index), batch_size=batch_size,shuffle=False, num_workers=0)

test_loader = DataLoader(Subset(dset2, task.test_index), batch_size=batch_size,shuffle=False, num_workers=0)

My understanding is that we need to take two graph (protein) samples, embed them and predict a regression value for the similarity. Using the PyG dataloader will batch all dictionaries together, that's why i decided to select only the protein ID to be batched, and so i removed the ['protein']['ID'] part from the target task function in structure_similarity.py. As a result, my model looks as follows:

    def forward(self, batch):

        it=0
        for sample in batch: ## embed each batch in every sample separately. 
          x=sample.x
          edge_index=sample.edge_index

          x = self.x_embedding(x)
          x = self.conv1(x, edge_index)
          x = F.leaky_relu(x)
          x=self.bano1(x)
          #x = F.dropout(x, training=self.training,p=0.2)

          x = self.conv2(x, edge_index)
          x = F.leaky_relu(x)
          x=self.bano2(x)
          #x = F.dropout(x, training=self.training,p=0.2)

          x = self.conv3(x, edge_index)
          x = F.relu(x)
          x = self.bano3(x)
          # #x = F.dropout(x, training=self.training,p=0.2)

          x = self.conv4(x, edge_index)
          # x = F.relu(x)
          # x = self.bano3(x)
          # # #x = F.dropout(x, training=self.training,p=0.2)
          if it==0:
            s1=global_add_pool(x, sample.batch)
          else:
            s2=global_add_pool(x, sample.batch)

          it+=1
        final=self.mlpRep(s1+s2)

        return final 

and the evaluation function where i have to do a for loop to append the ground truths labels for the similarity values.

@torch.no_grad()
def eval_epoch(model, loader):
    model.eval()

    y_true = []
    y_pred = []

    for step, batch in enumerate(val_loader):
        size = len(batch[0].y)
        batch[0] = batch[0].to(device)
        batch[1] = batch[1].to(device)

        y_hat=model(batch)

        truths=[]
        for g in range(size):
          truths.append(task.targetBatch(batch[0].y[g],batch[1].y[g]))
        y_pred.append(y_hat)
        y_true.append(torch.Tensor(truths))

    y_true = torch.hstack(y_true).detach().cpu().numpy()
    y_pred2 = torch.vstack(y_pred).detach().cpu().numpy()
    scores = task.evaluate(y_true, y_pred2)
    return scores

Of course the training is taking too long and I would appreciate any tip on how to use the protein shake package more efficiently for this task. Thanks a lot in advance !

cgoliver commented 4 months ago

Dear @ahariri13 thank you for contacting us! Glad to hear you are using the tool.

I am very busy for the next week and haven't had a chance to look closely at your code but I know @claying has run some experiments on this dataset and would have some suggestions.

Best, Carlos