BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
286 stars 109 forks source link

needs TPM value of TCGA RNA-seq data. Can TCGAbiolinks help me #307

Open modarzi opened 5 years ago

modarzi commented 5 years ago

I have downloaded SARC RNA-seq data by HTseq-Count workflow type through TCGAbiolinks. But I need SARC RNA-seq data based on Transcripts Per Million (TPM) value. Can TCGAbiolinks help me to download TCGA RNA-seq based on TPM values? or does this package have any function for converting data set with TPM value to FPKM or FPKM-UQ? I appreciate if anybody share his/her comment with me. Best Regards

y1zhou commented 5 years ago

You can transform FPKM values to TPM values using this:

FPKMtoTPM <- function(x) {
  return(exp(log(x) - log(sum(x)) + log(1e6)))
}

The way I use this is

df <- data.table::fread(
    str_glue("/path/to/TCGA/FPKM/{proj}.FPKM.csv")
  ) %>%
    mutate_if(is.numeric, FPKMtoTPM)

which in my case the csv file is a matrix with rows as Ensembl genes and columns as samples (patients).

modarzi commented 5 years ago

@y1zhou Hi, Thank you for your solution.I have 2 questions: 1- Generally, before using FPKM or FPKM-UQ data I transfer mydata to new space by applying log2(mydata+1). So, should I tranfer the output of your function (df) to new space by log2() or not? 2-I want to use this function in my analysis as one of the pre-processing steps. I appreciate if you share academic reference(paper) of this function(FPKMtoTPM). Best Regards,

y1zhou commented 5 years ago
  1. I guess you can still perform a log transformation, but you'd lose the consistent column sum (1e6) for the TPMs.
  2. The paper was mentioned in this blog post, which is also where I found the code snippet.