Open zeavin-ferguson opened 4 years ago
I found the list of ImSig genes in the paper. NK cells are not represented in my brain datasets and only 4 of the NK cell markers are in my blood datasets. There is pretty good representation of the rest of the cell types in my datasets but varies based on the cell type.
I also noticed that I get a different error if I restrict the number of rows. The error I get if I only use 500 genes is:
Hi @zeavin-ferguson You should also be able to access the signature by typing sig
after loading the package.
Again Error in fastCor(t(exp)) : invalid nSplit: 0
is probably due to either poor overlap between the signature genes and your dataset or duplicate gene names in the dataset.
I should have just accounted for these when I made the package. Unfortunately, I don't have the bandwidth to do it now :(
Hi @ajitjohnson That makes sense then that the error:
happens when I restrict the rows of my dataset.
But I checked the overlap and it is pretty good for the whole dataset. However, when I use the whole dataset I am getting the error: Error in exp[as.character(g), ] : incorrect number of dimensions
Any idea what that one is about? I attached my dataset to the first message.
Hi @zeavin-ferguson I just took a look at your data.
The expression set should not be scaled data. For correlation to work appropriately, it needs to be in natural scale without log transformation (e.g. FPKM/TPM).
Although this might not be the issue that you are facing.
Alternatively, you also simply look at the mean/median expression of all the signature genes without the correlation step.
Here is what you could do.
# Mean Expression Function
imsig <- function(exp,sig){
# Subset genes that are present in the sig
exp <- exp[row.names(exp) %in% sig$gene,]
sig <- sig[sig$gene %in% row.names(exp),]
# Loop to calculate the average expression of each cell type
cc <- data.frame(matrix(nrow = ncol(exp)))
cc <- cc[,-1]
for (i in unique(sig$cell)){
s <- sig[sig$cell %in% i,]
e <- exp[as.character(s$gene),]
e_avg <- data.frame(colMeans(e, na.rm = TRUE))
colnames(e_avg) <- i
cc <- cbind(cc, e_avg)
}
return(cc)
}
# Plotting Function
plot_abundance <- function(proportion){
require(ggplot2)
require(gridExtra)
cell <- proportion
cell$samples <- row.names(cell)
cell$samples <- factor(cell$samples, levels = cell$samples)
plots = lapply(1:(ncol(cell)-1), function(x) ggplot(cell, aes(x = cell$samples, y = cell[,x]))
+ geom_bar(stat = "identity") + theme_classic() +
theme(axis.title.x=element_blank(), axis.text.x = element_text(angle = 90, hjust = 1), axis.title.y=element_blank())+
ggtitle(colnames(cell)[x]))
do.call(grid.arrange, plots)
}
# Load the dataset
exp = read.table('exp_data_for_imsig.txt', header = T, row.names = 1, sep = '\t')
sig <- sig
# Run the function
proportion <- imsig(exp,sig)
plot_abundance (proportion)
I also notice that you have two cell types with only one gene in the signature and so I recommend doing this-
sig <- sig[!sig$cell %in% c('NK cells', 'Plasma cells'),]
sig <- droplevels(sig)
Thank you @ajitjohnson!! The functions work great - I just had to change V1 and V2 for sig to gene and cell, respectively. I will also try using FPKM or TPM normalized data and using the correlation step to see if that was the issue. I appreciate your help!
I am trying to run imsig on some mouse gene expression datasets I have. I get this error:
Error in exp[as.character(g), ] : incorrect number of dimensions
Although these are mouse datasets, the gene names are in HGNC format. I checked the multisymbol checker here: https://www.genenames.org/tools/multi-symbol-checker/ to ensure that my genes are represented in the HGNC database - 88.6% match approved symbols. There are no duplicate gene names. There are no missing values. Maybe I do not have enough overlap with the Imsig genes? but how can I check to see what the overlap is between the expression data and Imsig? I do not see where these are stored for me to check the overlap. I attached the expression dataset I am trying to run imsig on in case that is helpful. exp_data_for_imsig.txt