bzhanglab / COSMO

COSMO: COrrection of Sample Mislabeling by Omics
9 stars 5 forks source link

Interpretation of results #3

Open pacificma opened 2 years ago

pacificma commented 2 years ago

Dear Authors,

I am running COSMO for data sets with protein and rna. Final results table showing one of the clinical entry was '-1'. I wonder how this could be interpreted(not any clinical profile matched to this sample? I guess this situation would happen more frequently when the number of categories included in the analysis is increasing?) and what criterion the method used to have this summarized in clinical table.

Best, Weiping

pacificma commented 2 years ago

For example we only consider gender as the clinical predictor (binary)

1.If we only have one sample with clinical (-1) , it means the estimated gender is different from what it is labeled. but all the other samples look good on this.

2.If we have multiple samples with clinical (-1), it means that those gender estimates were different from the labels but algorithm can not infer if they are swapped from the combination of clincial and omics data?

3.if clinical was labeled with the other number in one sample, means the omics data and clinical were swapped at the same time?

Am I correct on those scenarios?

Another question, if we using more than 2 omics data to check, I guess we could only use 2 at a time and run multiple times. Is there any systematic way to aggregate those results from multiple omics data?

pacificma commented 2 years ago

I have another question: If we see some results as illustrated in the demo

sample | Clinical | Data1 | Data2 -- | -- | -- | -- Testing_8 | 8 | 8 | 8 Testing_9 | 9 | 8 | 9

for the mismatched sample in the second line, does that mean data1 of the 8th sample matches to the data2 and clinical of 9th sample , or the data1 of 9th sample matches to the other data of 8th sample? My take is the later , is that correct?

soonjye commented 2 years ago

For example we only consider gender as the clinical predictor (binary)

1.If we only have one sample with clinical (-1) , it means the estimated gender is different from what it is labeled. but all the other samples look good on this.

2.If we have multiple samples with clinical (-1), it means that those gender estimates were different from the labels but algorithm can not infer if they are swapped from the combination of clincial and omics data?

3.if clinical was labeled with the other number in one sample, means the omics data and clinical were swapped at the same time?

Am I correct on those scenarios?

Another question, if we using more than 2 omics data to check, I guess we could only use 2 at a time and run multiple times. Is there any systematic way to aggregate those results from multiple omics data?

Yes, you are correct for all three conditions. Right now, COSMO is able to run on only 2 omics data. It does not aggregate results from different pairs of omics data.

soonjye commented 2 years ago

I have another question: If we see some results as illustrated in the demo

sample Clinical Data1 Data2 Testing_8 8 8 8 Testing_9 9 8 9 for the mismatched sample in the second line, does that mean data1 of the 8th sample matches to the data2 and clinical of 9th sample , or the data1 of 9th sample matches to the other data of 8th sample? My take is the later , is that correct?

The later is correct. Looking at the table, it could be a duplication: where Data1 of Sample8th is duplicated, and replaced Data1 of Sample9th.

pacificma commented 2 years ago

Thank you!

I believe the mislabeling results of omics data was provided by method 1 only. And when I am looking at the final result table, the best match was not the same as the table provided from method 1. I wonder did you apply any additional adjustment from the result of method 1?

Final results table

sample | Clinical | Data1 | Data2 -- | -- | -- | -- #22 | 22 | 22 | 46 #71 | 71 | 25 | 71 #72 | 72 | 72 | 46

Method 1 table

d1 | d1_label | d2 | d2_label | d1rank | d2rank | distance | correlation -- | -- | -- | -- | -- | -- | -- | -- 22 | #22 | 72 | #72 | 69 | 65 | 134 | -0.209155948 25 | #25 | 71 | #71 | 16 | 11 | 27 | 0.162347354 71 | #71 | 25 | #25 | 2 | 1 | 3 | 0.780665347 72 | #72 | 22 | #22 | 1 | 3 | 4 | 0.218780229

soonjye commented 2 years ago

Thank you!

I believe the mislabeling results of omics data was provided by method 1 only. And when I am looking at the final result table, the best match was not the same as the table provided from method 1. I wonder did you apply any additional adjustment from the result of method 1?

Final results table

sample Clinical Data1 Data2

22 22 22 46

71 71 25 71

72 72 72 46

Method 1 table

d1 d1_label d2 d2_label d1rank d2rank distance correlation 22 #22 72 #72 69 65 134 -0.209155948 25 #25 71 #71 16 11 27 0.162347354 71 #71 25 #25 2 1 3 0.780665347 72 #72 22 #22 1 3 4 0.218780229

Right. There are two methods in the algorithm, each from different winning teams. The final result table utilized predictions from both winning teams.