andygxzeng / AMLHierarchies

Analysis notebooks and data for AML hierarchies paper
15 stars 3 forks source link

TPM normalization before CIBERSORTx deconvolution #2

Closed albertop210 closed 1 year ago

albertop210 commented 1 year ago

Hi Andy, we tried to run CIBERSORTx to deconvolute our bulk RNAseq data. In your vignette you suggest to use TPM-normalized matrix; how did you performed the normalization on your counts to obtain TPM? Thanks in advance,

Alberto

andygxzeng commented 1 year ago

Hi Alberto,

I've uploaded files that I use for hg38 transcript lengths and hg19 transcript lengths

When you have that you can adapt some of this code to correct count data for median transcript length and normalize to 1,000,000 per sample

## Load transcript lengths for hg38 and get median length
transcripts <- fread('GRCh38_transcript_lengths.txt')
colnames(transcripts) <- c("ensg", "Gene", "transcript", "start", "stop")
transcript_lengths <- transcripts %>% 
  dplyr::mutate(len = stop - start) %>% 
  dplyr::group_by(Gene) %>% 
  dplyr::summarise(median_len = median(len))

## Function to collapse duplicate gene names
collapse_genes <- function(data){
  dat_new <- data %>% dplyr::group_by(Gene) %>% dplyr::summarise_all(sum) %>% dplyr::ungroup()
  return(dat_new)
}

## Convert counts to TPM
# require gene symbol column to be named "Gene"
# require GrCh38 gene symbols
count_to_TPM <- function(data, transcript_lengths){
  dat_merged <- data %>% dplyr::inner_join(., transcript_lengths, by = 'Gene') %>% select('Gene', 'median_len', everything())

  Gene <- dat_merged$Gene

  # Normalize counts by median transcript length in kb then divide by colSums and multiply by 1,000,000
  dat_TPM <- cbind(Gene, 
        (select(dat_merged, -c('Gene', 'median_len')) / (dat_merged$median_len/1000)) %>% 
          sweep(., 2, colSums(.), '/') * 1000000)

  return(dat_TPM)
}

After this you can load in your dataframe with genes as rows and samples as column; Gene is the first column

AMLdata %>%  collapse_genes() %>% filter_genes() %>% count_to_TPM(transcript_lengths)

AMLdata %>% 
  write.table(., file = 'AMLdata_TPM.txt', sep = '\t', row.names = FALSE)

hope that helps! Andy

albertop210 commented 1 year ago

Thanks a lot for your fast and kind response; in my code I used featureCounts' transcript lengths, since my analysis pipeline was based on Rsubread. TPM normalization produced slightly different results and thus did CIBERSORTx deconvolution. Using your function seems to produce quite better results after the deconvolution process and more reliable cell fractions values. Thanks again for your help, Alberto