Closed Favourj-bit closed 1 year ago
Another question, I noticed that the values in OS_months column are in float datatype. I do not really understand this.
@Favourj-bit I'm not sure what is the exact thing, but looks like the code return such kind of values. This is the explanation from the website. https://github.com/cBioPortal/cbioportal/blob/9af6061f6a0c46d7fb3b412ca45b38cc0690b14d/portal/src/main/resources/content/web_api.markdown?plain=1#L247
This is the code for survival analysis. https://github.com/cBioPortal/cbioportal/blob/9af6061f6a0c46d7fb3b412ca45b38cc0690b14d/core/src/main/resources/survival_
This is the library page. https://www.emilyzabor.com/tutorials/survival_analysis_in_r_tutorial.html no_plots.txt#L2
So, I think the month can also have a decimal point.
@inoue0426 , thanks for this. I will check it out
@Favourj-bit the float numbers are just parts of months. 10.0 would be 10 months. 10.5 would be 10 months and 15 days approximately. No need to go to deeply in the cBioPortal code and no need to learn how to do survival analyses at this point. As far as samples and patients. Multiple samples can be taken from a single patient. The patient would have one single overall survival time.
@cannin, Thanks for the explanation. I want to merge the two datasets on the identifier column. I'm considering it will be better to merge it into the patient identifier column. However, I'm not too sure if I should drop the sample identifier column, or I should rather merge using the patient identifier and sample identifier. I'm a little confused on this
@Favourj-bit right now the goal is to get basic working PyG that can predict (even if very imperfectly). For now, you can merge so that all the samples have an OS_Survival value even if duplicated.
@cannin, could i drop columns that all the values are missing? Two examples below: This one is from the data_clinical_patient.txt
I have also merged the data. However, I'm bit confused on how to bring the data_mrna_seq_v2_rsem.txt into the merged dataset.
Another question I have from our email conversations, you mentioned that the pathway commons data have identifiers that match the network data from Pathway Commons. However, from the one you specified for me to use, I can't understand the identifier which i'm supposed to use. Here is it:
@Favourj-bit
I think You can add the cBio data, like NUDT18, 85.0144, 165.884, 309.2880, 178.931, 124.066 to the target node attribution.
I have also merged the data. However, I'm bit confused on how to bring the data_mrna_seq_v2_rsem.txt into the merged dataset.
Another question I have from our email conversations, you mentioned that the pathway commons data have identifiers that match the network data from Pathway Commons. However, from the one you specified for me to use, I can't understand the identifier which i'm supposed to use. Here is it:
@Favourj-bit This is the procedure to train and evaluate the graph and model.
For the creating Graph, the code is as below.
import torch
from torch_geometric.data import Data
edge_index = torch.tensor([[0, 1, 1, 2],
[1, 0, 2, 1]], dtype=torch.long)
x = torch.tensor([[-1], [0], [1]], dtype=torch.float)
data = Data(x=x, edge_index=edge_index)
>>> Data(edge_index=[2, 4], x=[3, 1])
The x is from cBioPortals' patient's data and the edge_index is from PathwayCommons.
For the train test split, you can use sklearn. https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
For running model, basic flow is as below:
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
class GCN(torch.nn.Module):
def __init__(self):
super().__init__()
self.conv1 = GCNConv(dataset.num_node_features, 16)
self.conv2 = GCNConv(16, dataset.num_classes)
def forward(self, data):
x, edge_index = data.x, data.edge_index
x = self.conv1(x, edge_index)
x = F.relu(x)
x = F.dropout(x, training=self.training)
x = self.conv2(x, edge_index)
return F.log_softmax(x, dim=1)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = GCN().to(device)
data = dataset[0].to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
model.train()
for epoch in range(200):
optimizer.zero_grad()
out = model(data)
loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask])
loss.backward()
optimizer.step()
model.eval()
pred = model(data).argmax(dim=1)
correct = (pred[data.test_mask] == data.y[data.test_mask]).sum()
acc = int(correct) / int(data.test_mask.sum())
print(f'Accuracy: {acc:.4f}')
>>> Accuracy: 0.8150
@inoue0426 I have been able to create graph structure from the pathway commons. I was hoping you could give this a look before i proceed with using it to build a model. I want to be certain i have the right results: https://colab.research.google.com/drive/1jY7Y_M6hU84jPtDDO-gRuamTzTxuEK6A?usp=sharing
@cannin this is the final result gotten from matching the data: https://colab.research.google.com/drive/16irWREQOMOOSuv-aBWLwUERZ8_gqWu8V?usp=sharing
I want to confirm that i'm on the right path
Looks good to me!
@inoue0426 I have been able to create graph structure from the pathway commons. I was hoping you could give this a look before i proceed with using it to build a model. I want to be certain i have the right results: https://colab.research.google.com/drive/1jY7Y_M6hU84jPtDDO-gRuamTzTxuEK6A?usp=sharing
Hi @cannin ,
I am working on the notebook to match these two datasets based on the patient_id. However, I noticed that in the data_clinical_sample.txt data, the columns '#Patient Identifier', and 'Sample Identifier' contain the same things. Also, there is no column like sample ID in the data_clinical_patient.txt data. I wanted to confirm if i could proceed on merging these two dataset on the '#Patient Identifier' alone. I have attached my colab notebook to show my work https://colab.research.google.com/drive/16irWREQOMOOSuv-aBWLwUERZ8_gqWu8V?usp=sharing