Closed albertop210 closed 1 year ago
Hi Alberto,
I've uploaded files that I use for hg38 transcript lengths and hg19 transcript lengths
When you have that you can adapt some of this code to correct count data for median transcript length and normalize to 1,000,000 per sample
## Load transcript lengths for hg38 and get median length
transcripts <- fread('GRCh38_transcript_lengths.txt')
colnames(transcripts) <- c("ensg", "Gene", "transcript", "start", "stop")
transcript_lengths <- transcripts %>%
dplyr::mutate(len = stop - start) %>%
dplyr::group_by(Gene) %>%
dplyr::summarise(median_len = median(len))
## Function to collapse duplicate gene names
collapse_genes <- function(data){
dat_new <- data %>% dplyr::group_by(Gene) %>% dplyr::summarise_all(sum) %>% dplyr::ungroup()
return(dat_new)
}
## Convert counts to TPM
# require gene symbol column to be named "Gene"
# require GrCh38 gene symbols
count_to_TPM <- function(data, transcript_lengths){
dat_merged <- data %>% dplyr::inner_join(., transcript_lengths, by = 'Gene') %>% select('Gene', 'median_len', everything())
Gene <- dat_merged$Gene
# Normalize counts by median transcript length in kb then divide by colSums and multiply by 1,000,000
dat_TPM <- cbind(Gene,
(select(dat_merged, -c('Gene', 'median_len')) / (dat_merged$median_len/1000)) %>%
sweep(., 2, colSums(.), '/') * 1000000)
return(dat_TPM)
}
After this you can load in your dataframe with genes as rows and samples as column; Gene is the first column
AMLdata %>% collapse_genes() %>% filter_genes() %>% count_to_TPM(transcript_lengths)
AMLdata %>%
write.table(., file = 'AMLdata_TPM.txt', sep = '\t', row.names = FALSE)
hope that helps! Andy
Thanks a lot for your fast and kind response; in my code I used featureCounts' transcript lengths, since my analysis pipeline was based on Rsubread. TPM normalization produced slightly different results and thus did CIBERSORTx deconvolution. Using your function seems to produce quite better results after the deconvolution process and more reliable cell fractions values. Thanks again for your help, Alberto
Hi Andy, we tried to run CIBERSORTx to deconvolute our bulk RNAseq data. In your vignette you suggest to use TPM-normalized matrix; how did you performed the normalization on your counts to obtain TPM? Thanks in advance,
Alberto