Create match between data_clinical_patient.txt, data_clinical_sample.txt from acc_tcga_pan_can_atlas_2018

Favourj-bit commented 1 year ago

Hi @cannin ,

I am working on the notebook to match these two datasets based on the patient_id. However, I noticed that in the data_clinical_sample.txt data, the columns '#Patient Identifier', and 'Sample Identifier' contain the same things. Also, there is no column like sample ID in the data_clinical_patient.txt data. I wanted to confirm if i could proceed on merging these two dataset on the '#Patient Identifier' alone. I have attached my colab notebook to show my work https://colab.research.google.com/drive/16irWREQOMOOSuv-aBWLwUERZ8_gqWu8V?usp=sharing

Favourj-bit commented 1 year ago

Another question, I noticed that the values in OS_months column are in float datatype. I do not really understand this.

inoue0426 commented 1 year ago

@Favourj-bit I'm not sure what is the exact thing, but looks like the code return such kind of values. This is the explanation from the website. https://github.com/cBioPortal/cbioportal/blob/9af6061f6a0c46d7fb3b412ca45b38cc0690b14d/portal/src/main/resources/content/web_api.markdown?plain=1#L247

This is the code for survival analysis. https://github.com/cBioPortal/cbioportal/blob/9af6061f6a0c46d7fb3b412ca45b38cc0690b14d/core/src/main/resources/survival_

This is the library page. https://www.emilyzabor.com/tutorials/survival_analysis_in_r_tutorial.html Screenshot 2023-05-30 at 11 39 30 no_plots.txt#L2

So, I think the month can also have a decimal point.

Favourj-bit commented 1 year ago

@inoue0426 , thanks for this. I will check it out

cannin commented 1 year ago

@Favourj-bit the float numbers are just parts of months. 10.0 would be 10 months. 10.5 would be 10 months and 15 days approximately. No need to go to deeply in the cBioPortal code and no need to learn how to do survival analyses at this point. As far as samples and patients. Multiple samples can be taken from a single patient. The patient would have one single overall survival time.

Favourj-bit commented 1 year ago

@cannin, Thanks for the explanation. I want to merge the two datasets on the identifier column. I'm considering it will be better to merge it into the patient identifier column. However, I'm not too sure if I should drop the sample identifier column, or I should rather merge using the patient identifier and sample identifier. I'm a little confused on this

cannin commented 1 year ago

@Favourj-bit right now the goal is to get basic working PyG that can predict (even if very imperfectly). For now, you can merge so that all the samples have an OS_Survival value even if duplicated.

Favourj-bit commented 1 year ago

@cannin, could i drop columns that all the values are missing? Two examples below: This one is from the data_clinical_patient.txt

Favourj-bit commented 1 year ago

I have also merged the data. However, I'm bit confused on how to bring the data_mrna_seq_v2_rsem.txt into the merged dataset.

Another question I have from our email conversations, you mentioned that the pathway commons data have identifiers that match the network data from Pathway Commons. However, from the one you specified for me to use, I can't understand the identifier which i'm supposed to use. Here is it:

inoue0426 commented 1 year ago

@Favourj-bit

I think You can add the cBio data, like NUDT18, 85.0144, 165.884, 309.2880, 178.931, 124.066 to the target node attribution.

I have also merged the data. However, I'm bit confused on how to bring the data_mrna_seq_v2_rsem.txt into the merged dataset.

Another question I have from our email conversations, you mentioned that the pathway commons data have identifiers that match the network data from Pathway Commons. However, from the one you specified for me to use, I can't understand the identifier which i'm supposed to use. Here is it:

inoue0426 commented 1 year ago

@Favourj-bit This is the procedure to train and evaluate the graph and model. Screenshot 2023-06-01 at 10 42 18

For the creating Graph, the code is as below.

import torch
from torch_geometric.data import Data

edge_index = torch.tensor([[0, 1, 1, 2],
                           [1, 0, 2, 1]], dtype=torch.long)
x = torch.tensor([[-1], [0], [1]], dtype=torch.float)

data = Data(x=x, edge_index=edge_index)
>>> Data(edge_index=[2, 4], x=[3, 1])

The x is from cBioPortals' patient's data and the edge_index is from PathwayCommons.

For the train test split, you can use sklearn. https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

For running model, basic flow is as below:

import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv

class GCN(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = GCNConv(dataset.num_node_features, 16)
        self.conv2 = GCNConv(16, dataset.num_classes)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index

        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, training=self.training)
        x = self.conv2(x, edge_index)

        return F.log_softmax(x, dim=1)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = GCN().to(device)
data = dataset[0].to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

model.train()
for epoch in range(200):
    optimizer.zero_grad()
    out = model(data)
    loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()

model.eval()
pred = model(data).argmax(dim=1)
correct = (pred[data.test_mask] == data.y[data.test_mask]).sum()
acc = int(correct) / int(data.test_mask.sum())
print(f'Accuracy: {acc:.4f}')
>>> Accuracy: 0.8150

Favourj-bit commented 1 year ago

@inoue0426 I have been able to create graph structure from the pathway commons. I was hoping you could give this a look before i proceed with using it to build a model. I want to be certain i have the right results: https://colab.research.google.com/drive/1jY7Y_M6hU84jPtDDO-gRuamTzTxuEK6A?usp=sharing

Favourj-bit commented 1 year ago

@cannin this is the final result gotten from matching the data: https://colab.research.google.com/drive/16irWREQOMOOSuv-aBWLwUERZ8_gqWu8V?usp=sharing

I want to confirm that i'm on the right path

inoue0426 commented 1 year ago

Looks good to me!

@inoue0426 I have been able to create graph structure from the pathway commons. I was hoping you could give this a look before i proceed with using it to build a model. I want to be certain i have the right results: https://colab.research.google.com/drive/1jY7Y_M6hU84jPtDDO-gRuamTzTxuEK6A?usp=sharing

cannin / gsoc_2023_pytorch_pathway_commons

Create match between data_clinical_patient.txt, data_clinical_sample.txt from acc_tcga_pan_can_atlas_2018 #6