BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
284 stars 109 forks source link

a way to merge mutation data to gene expression data? #591

Closed 14109022 closed 12 months ago

14109022 commented 1 year ago

Hi,

I'm really new to RNA-seq/Bioinformatics/TCGAbiolinks so I do apologise if this seems like a silly question! I am trying to analyse gene expression data of the TCGA-LUAD project, and more specifically trying to see gene expression differences in WT-TP53 and Mutant-TP53 TCGA-LUAD patients.

My initial approach was to get the Transcriptomic Profiling RNA-seq data and also the Simple Nucleotide Variation data and use one of the ID's to match up patients with and without TP53 mutations. However, I have not been able to find any common identifiers between the objects.

Is there an alternative method for what I am trying to do, or is this not possible at all?

Thank you so much in advance

tiagochst commented 12 months ago

Hi,

You should use the sample information (first 16 characters in the TCGA barcode) to match the information. And example is shown below. Just check what are the types of mutation you want to consider.

library(TCGAbiolinks)
query <- GDCquery(
  project = "TCGA-LUAD", 
  data.category = "Simple Nucleotide Variation", 
  access = "open",
  data.type = "Masked Somatic Mutation", 
  workflow.type = "Aliquot Ensemble Somatic Variant Merging and Masking"
)
GDCdownload(query)
maf <- GDCprepare(query)

mutations_tp53 <- maf |> dplyr::filter(Hugo_Symbol == "TP53")
table(mutations_tp53$VARIANT_CLASS)
table(mutations_tp53$IMPACT)

maf$Tumor_Sample_Barcode
query <- GDCquery(
  project = "TCGA-LUAD",
  data.category = "Transcriptome Profiling",
  data.type = "Gene Expression Quantification", 
  workflow.type = "STAR - Counts"
)
GDCdownload(query = query,files.per.chunk = 30)
data <- GDCprepare(query = query)

# Add mutation information to SE
data$mutation_tp53 <- data$sample %in% substr(mutations_tp53$Tumor_Sample_Barcode,1,16)
table(data$mutation_tp53)
14109022 commented 12 months ago

Hi,

You should use the sample information (first 16 characters in the TCGA barcode) to match the information. And example is shown below. Just check what are the types of mutation you want to consider.

library(TCGAbiolinks)
query <- GDCquery(
  project = "TCGA-LUAD", 
  data.category = "Simple Nucleotide Variation", 
  access = "open",
  data.type = "Masked Somatic Mutation", 
  workflow.type = "Aliquot Ensemble Somatic Variant Merging and Masking"
)
GDCdownload(query)
maf <- GDCprepare(query)

mutations_tp53 <- maf |> dplyr::filter(Hugo_Symbol == "TP53")
table(mutations_tp53$VARIANT_CLASS)
table(mutations_tp53$IMPACT)

maf$Tumor_Sample_Barcode
query <- GDCquery(
  project = "TCGA-LUAD",
  data.category = "Transcriptome Profiling",
  data.type = "Gene Expression Quantification", 
  workflow.type = "STAR - Counts"
)
GDCdownload(query = query,files.per.chunk = 30)
data <- GDCprepare(query = query)

# Add mutation information to SE
data$mutation_tp53 <- data$sample %in% substr(mutations_tp53$Tumor_Sample_Barcode,1,16)
table(data$mutation_tp53)

This way makes much more sense. Thank you so much, I really really appreciate it!