Swarbricklab-code / BrCa_cell_atlas

Data processing and analysis related code associated with the study "A single-cell and spatially resolved atlas of human breast cancers".
107 stars 47 forks source link

scSubtype related code,Is it possible to call directly? #16

Open zhangjl-work opened 2 years ago

zhangjl-work commented 2 years ago
hi, I have a batch of breast cancer single cell data, can I directly call the Highest_calls.R script?  

I call this code directly, and use the NatGen_Supplementary_table_S4.csv file, and the results don't seem right, it seems to divide the cells in each sample (after extracting the cancer cells) into the four subtypes, and the dozen or so samples are So, do you know why,Is this result reasonable?  At present, it is not consistent with the clinical results. But I don't know where is the problem。

Looking forward to your reply!

dlroden commented 2 years ago

Hi, thanks for your interest in our work. Could you please provide more detail to allow us to understand your question more clearly? At a minimum, a reproducible example of the input data, the specific code that you are running and the output would be needed to help further. Thanks

zhangjl-work commented 2 years ago

hi,thank you for your reply!

I use your trained file NatGen_Supplementary_table_S4.csv,as input file。 and use my own rds file。the following is my code.

library(Seurat)

read in scsubtype gene signatures

sigdat <- read.csv("NatGen_Supplementary_table_S4.csv") temp_allgenes <- c(as.vector(sigdat[,"Basal_SC"]), as.vector(sigdat[,"Her2E_SC"]), as.vector(sigdat[,"LumA_SC"]), as.vector(sigdat[,"LumB_SC"])) temp_allgenes <- unique(temp_allgenes[!temp_allgenes == ""])

读取数据

allfile <- list.files("/PROJ2/development/zhangjunling/project/breast_cancer/scSubtype/data/", pattern = 'rds') rds_list <- list() for (i in allfile) { print (i) file_path <- paste0("./data/",i) Mydata <- readRDS(file_path)

Read in the single cell RDS object as 'Mydata'

#Mydata <- readRDS("./B_epi_filter_T.rds")
Mydata <- ScaleData(Mydata, features=temp_allgenes)
***@***.******@***.***)
#calculate mean scsubtype scores
outdat <- matrix(0,
                 nrow=ncol(sigdat),
                 ncol=ncol(tocalc),
                 dimnames=list(colnames(sigdat),
                               colnames(tocalc)))
for(i in 1:ncol(sigdat)){
  row <- as.character(sigdat[,i])
  row<-unique(row[row != ""])
  genes<-which(rownames(tocalc) %in% row)
  temp<-apply(tocalc[genes,],2,function(x){mean(as.numeric(x),na.rm=TRUE)})
  outdat[i,]<-as.numeric(temp)
}
final<-outdat[which(rowSums(outdat,na.rm=TRUE)!=0),]
final<-as.data.frame(final)
is.num <- sapply(final, is.numeric);final[is.num] <- lapply(final[is.num], round, 4)
finalm<-as.matrix(final)
##Scaling scores function before calling the highest Call
center_sweep <- function(x, row.w = rep(1, nrow(x))/nrow(x)) {
  get_average <- function(v) sum(v * row.w)/sum(row.w)
  average <- apply(x, 2, get_average)
  sweep(x, 2, average)
}
##Obtaining the highest call
finalmt<-as.data.frame(t(finalm))
#print (head(finalmt))
finalm.sweep.t<-center_sweep(finalmt)
Finalnames<-colnames(finalm.sweep.t)[max.col(finalm.sweep.t,ties.method="first")]
finalm.sweep.t$SCSubtypeCall <- Finalnames
result <- as.data.frame(table(Finalnames))
print (result)

}

Below is my output:

[1] "G_epi_filter_LN1.rds" Centering and scaling data matrix

Finalnames Freq 1 Basal_SC 1184 2 Her2E_SC 1116 3 LumA_SC 1464 4 LumB_SC 1559 [1] "G_epi_filter_LN2.rds" Centering and scaling data matrix

Finalnames Freq 1 Basal_SC 906 2 Her2E_SC 848 3 LumA_SC 1073 4 LumB_SC 1111 [1] "G_epi_filter_T.rds" Centering and scaling data matrix

Finalnames Freq 1 Basal_SC 754 2 Her2E_SC 766 3 LumA_SC 923 4 LumB_SC 936

Seems to split my data equally across the four subtypes,why is that?

2022年6月27日 下午6:54,Daniel Roden @.***> 写道:

Hi, thanks for your interest in our work. Could you please provide more detail to allow us to understand your question more clearly? At a minimum, a reproducible example of the input data, the specific code that you are running and the output would be needed to help further. Thanks

— Reply to this email directly, view it on GitHub https://github.com/Swarbricklab-code/BrCa_cell_atlas/issues/16#issuecomment-1167201758, or unsubscribe https://github.com/notifications/unsubscribe-auth/AP747ND4PRHSX4YJIGDINZDVRGB7HANCNFSM5X6TTC4Q. You are receiving this because you authored the thread.

RegnerM2015 commented 2 years ago

Hi @zhangjl-work,

You actually have to integrate your data with their training data first. Then you apply the scSubtyping script to the rescaled integrated expression data (your data + their training data).

Applying the scSubtyping script directly to your data alone will result in a random or even distribution of subtype calls.

zhangjl-work commented 2 years ago

Thank you very much for your reply, but there is one more thing I don't understand.

Question 1: Do I have to have my own training set? If I don't have my own training set data, can I use the training set data from your article?

Question 2: What is the file format of the training set? The format of the TNBCmerged_Training_SwarbrickInferCNV.txt file, can you take a few lines to see it?

Question 3: What does the SinglecellMolecularSubtypesignaturesonlytumor_SEPTEMBER2019.tx file in Calculatingscoresandplotting.R do? Can you take a few lines and take a look?

Looking forward to your reply, thanks!

2022年6月28日 上午1:08,Matt Regner @.***> 写道:

Hi @zhangjl-work https://github.com/zhangjl-work,

You actually have to integrate your data with their training data first. Then you apply the scSubtyping script to the rescaled integrated expression data (your data + their training data).

Applying the scSubtyping script directly to your data alone will result in a random or even distribution of subtype calls.

— Reply to this email directly, view it on GitHub https://github.com/Swarbricklab-code/BrCa_cell_atlas/issues/16#issuecomment-1167585655, or unsubscribe https://github.com/notifications/unsubscribe-auth/AP747NCKZO4XTUB4CTWGPATVRHNXXANCNFSM5X6TTC4Q. You are receiving this because you were mentioned.

RegnerM2015 commented 2 years ago

Hi @zhangjl-work,

Question 1: Yes, use the training set data from the article.

Question 2: The format of the training set should be four Seurat objects (saved as RDS files). There should be one rds file for each subtype (LumA, LumB, Her2, and Basal). I am not familiar with the file TNBCmerged_Training_SwarbrickInferCNV.txt, perhaps @dlroden can clarify.

Question 3: I think SinglecellMolecularSubtypesignaturesonlytumor_SEPTEMBER2019.txt holds the gene signatures for each molecular subtype and Calculatingscoresandplotting.R computes the signature enrichment scores for these gene signatures.

I hope @dlroden and @sunnyzwu can clarify these.

zhangjl-work commented 2 years ago

Thank you very much for your reply, I will communicate with the person you recommended! Thanks again!

shufanzhang commented 2 years ago

Hi @zhangjl-work, Is your problem solved? I also want to know the format of the input data for scSubtype and where to get the training set file.