BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
289 stars 110 forks source link

Matching Clinical and RNA Samples #525

Open sylvia-science opened 2 years ago

sylvia-science commented 2 years ago

Hello,

I've downloaded the RNA data and I'd like to match the clinical data to it. However, I'm unsure about how to match the datasets using unique IDs. I see that they both have bcr_patient_barcode variables, but there is no overlap between the datasets unless I modify the RNA barcode slightly.

Here is the code I'm using to download the two datasets. You can see at the end where I print out the bcr_patient_barcode variables that they have a different format. However, I noticed that if I remove the last part after the dash in the RNA bcr_patient_barcode variable, I get almost complete overlap, so I assume this is what I need to be doing. Can someone explain if this is correct and why this is the case.

Thank you!

` tcgalist <- c("TCGA-BRCA")

###############

Definy query that contains samples of interest, aligned against hg19 (using legacy = TRUE) ===============

query_mRNA.hg19 <- GDCquery(project=tcgalist, data.category = "Gene expression", data.type = "Gene expression quantification", platform = "Illumina HiSeq", file.type = "results", experimental.strategy = "RNA-Seq", sample.type = c("Primary Tumor"), legacy = TRUE)

query_clinical <- GDCquery(project=tcgalist, data.category = "Clinical", file.type = "xml")

Download all TCGA gene expression samples using query

GDCdownload(query_mRNA.hg19, method = "client") GDCdownload(query_clinical, method = "client")

Prepare data

data.hg19.mRNA <- GDCprepare(query_mRNA.hg19, save = F)

data.clinical = GDCprepare_clinic(query_clinical, clinical.info = "patient")

check content

ncol(data.hg19.mRNA) # 1095 nrow(data.clinical) # 1174

data.clinical$bcr_patient_barcode[1:5] # "TCGA-3C-AAAU" "TCGA-3C-AALI" "TCGA-3C-AALJ" "TCGA-3C-AALK" "TCGA-4H-AAAK" data.hg19.mRNA$bcr_patient_barcode[1:5] # "TCGA-A8-A08S-01A" "TCGA-S3-AA11-01A" "TCGA-C8-A1HL-01A" "TCGA-BH-A42T-01A" "TCGA-A8-A09T-01A" `

tiagochst commented 2 years ago

The information about the TCGA barcode can be found at https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/ This should help you match the data and metadata.

TCGA-3C-AAAU is the patient, TCGA-3C-AAAU-01 is the primary tumor sample of TCGA-3C-AAAU patient.

On Wed, Jul 6, 2022 at 11:30 AM Sylvia @.***> wrote:

Hello,

I've downloaded the RNA data and I'd like to match the clinical data to it. However, I'm unsure about how to match the datasets using unique IDs. I see that they both have bcr_patient_barcode variables, but there is no overlap between the datasets unless I modify the RNA barcode slightly.

Here is the code I'm using to download the two datasets. You can see at the end where I print out the bcr_patient_barcode variables that they have a different format. However, I noticed that if I remove the last part after the dash in the RNA bcr_patient_barcode variable, I get almost complete overlap, so I assume this is what I need to be doing. Can someone explain if this is correct and why this is the case.

Thank you!

` tcgalist <- c("TCGA-BRCA")

###############

Definy query that contains samples of interest, aligned against hg19

(using legacy = TRUE) =============== query_mRNA.hg19 <- GDCquery(project=tcgalist, data.category = "Gene expression", data.type = "Gene expression quantification", platform = "Illumina HiSeq", file.type = "results", experimental.strategy = "RNA-Seq", sample.type = c("Primary Tumor"), legacy = TRUE)

query_clinical <- GDCquery(project=tcgalist, data.category = "Clinical", file.type = "xml") Download all TCGA gene expression samples using query

GDCdownload(query_mRNA.hg19, method = "client") GDCdownload(query_clinical, method = "client") Prepare data

data.hg19.mRNA <- GDCprepare(query_mRNA.hg19, save = F)

data.clinical = GDCprepare_clinic(query_clinical, clinical.info = "patient")

check content

ncol(data.hg19.mRNA) # 1095 nrow(data.clinical) # 1174

data.clinical$bcr_patient_barcode[1:5] # "TCGA-3C-AAAU" "TCGA-3C-AALI" "TCGA-3C-AALJ" "TCGA-3C-AALK" "TCGA-4H-AAAK" data.hg19.mRNA$bcr_patient_barcode[1:5] # "TCGA-A8-A08S-01A" "TCGA-S3-AA11-01A" "TCGA-C8-A1HL-01A" "TCGA-BH-A42T-01A" "TCGA-A8-A09T-01A" `

— Reply to this email directly, view it on GitHub https://github.com/BioinformaticsFMRP/TCGAbiolinks/issues/525, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQ6MGEIG6CUKV6GMJQADVSWJ7ZANCNFSM52Z7N3QA . You are receiving this because you are subscribed to this thread.Message ID: @.***>