Bin-Chen-Lab / spider

MIT License
4 stars 0 forks source link

overlap between seen protein and unseen protein names & NA in max_inferred_coef of confidence file #4

Closed LiuCanidk closed 1 week ago

LiuCanidk commented 1 month ago

I found something weird in the prediction result of SPIDER.

library(data.table) seen=fread('all_seen_proteins_predicted.csv', nThread = 4) library(tibble) seen=column_to_rownames(seen, 'V1') seen[1:5,1:5] unseen=fread('all_unseen_proteins_predicted.csv', nThread = 4) unseen=column_to_rownames(unseen, 'V1') unseen[1:5,1:5] intersect(colnames(seen), colnames(unseen))

The output

[1] "CD14" "CD163" "CD226" "CD24" "CD274" "CD33" "CD40" "CD4" "CD5"
[10] "CD81" "CD9" "KLRG1" "CD44" "EGFR" "CD58" "CD59" "CD177" "CD55"
[19] "CD22" "CD63" "CD36" "MERTK" "GABRB3" "CD109" "CD164" "CD200" "CCR10" [28] "CD68" "CD47"

How should I process the seen and unseen protein data if they have overlap? select one? average? I did not notice any relevant tips in the downstream tutorial.

Also, when I filter the unseen protein based on the confidence coefficient provided by SPIDER, I found some NA values in the confidence file. Of course I can just remove these NA values but I am wondering why, with no error in the process of SPIDER prediction

confidence=read.csv('confidence_score_all_unseen_proteins.csv', row.names = 1) sum(is.na(confidence$max_inferred_coef))

[1] 9

Any discussion would be greatly appreciated!

Ruoqiao2020 commented 1 month ago

For the duplicated protein names, you can remove these proteins from the unseen protein data file, and just use their data from the seen protein data file.

About the NA values, the most likely reason is that if there are some genes which contain all-zero values or NA values in your transcriptomic data, then certain calculation steps in SPIDER's may generate results of NA. You can remove the predicted NA values for downstream analysis.

LiuCanidk commented 1 month ago

For the duplicated protein names, you can remove these proteins from the unseen protein data file, and just use their data from the seen protein data file.

About the NA values, the most likely reason is that if there are some genes which contain all-zero values or NA values in your transcriptomic data, then certain calculation steps in SPIDER's may generate results of NA. You can remove the predicted NA values for downstream analysis.

Thanks for your reply!