Open sylvia-science opened 2 years ago
The information about the TCGA barcode can be found at https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/ This should help you match the data and metadata.
TCGA-3C-AAAU is the patient, TCGA-3C-AAAU-01 is the primary tumor sample of TCGA-3C-AAAU patient.
On Wed, Jul 6, 2022 at 11:30 AM Sylvia @.***> wrote:
Hello,
I've downloaded the RNA data and I'd like to match the clinical data to it. However, I'm unsure about how to match the datasets using unique IDs. I see that they both have bcr_patient_barcode variables, but there is no overlap between the datasets unless I modify the RNA barcode slightly.
Here is the code I'm using to download the two datasets. You can see at the end where I print out the bcr_patient_barcode variables that they have a different format. However, I noticed that if I remove the last part after the dash in the RNA bcr_patient_barcode variable, I get almost complete overlap, so I assume this is what I need to be doing. Can someone explain if this is correct and why this is the case.
Thank you!
` tcgalist <- c("TCGA-BRCA")
###############
Definy query that contains samples of interest, aligned against hg19
(using legacy = TRUE) =============== query_mRNA.hg19 <- GDCquery(project=tcgalist, data.category = "Gene expression", data.type = "Gene expression quantification", platform = "Illumina HiSeq", file.type = "results", experimental.strategy = "RNA-Seq", sample.type = c("Primary Tumor"), legacy = TRUE)
query_clinical <- GDCquery(project=tcgalist, data.category = "Clinical", file.type = "xml") Download all TCGA gene expression samples using query
GDCdownload(query_mRNA.hg19, method = "client") GDCdownload(query_clinical, method = "client") Prepare data
data.hg19.mRNA <- GDCprepare(query_mRNA.hg19, save = F)
data.clinical = GDCprepare_clinic(query_clinical, clinical.info = "patient")
check content
ncol(data.hg19.mRNA) # 1095 nrow(data.clinical) # 1174
data.clinical$bcr_patient_barcode[1:5] # "TCGA-3C-AAAU" "TCGA-3C-AALI" "TCGA-3C-AALJ" "TCGA-3C-AALK" "TCGA-4H-AAAK" data.hg19.mRNA$bcr_patient_barcode[1:5] # "TCGA-A8-A08S-01A" "TCGA-S3-AA11-01A" "TCGA-C8-A1HL-01A" "TCGA-BH-A42T-01A" "TCGA-A8-A09T-01A" `
— Reply to this email directly, view it on GitHub https://github.com/BioinformaticsFMRP/TCGAbiolinks/issues/525, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQ6MGEIG6CUKV6GMJQADVSWJ7ZANCNFSM52Z7N3QA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Hello,
I've downloaded the RNA data and I'd like to match the clinical data to it. However, I'm unsure about how to match the datasets using unique IDs. I see that they both have bcr_patient_barcode variables, but there is no overlap between the datasets unless I modify the RNA barcode slightly.
Here is the code I'm using to download the two datasets. You can see at the end where I print out the bcr_patient_barcode variables that they have a different format. However, I noticed that if I remove the last part after the dash in the RNA bcr_patient_barcode variable, I get almost complete overlap, so I assume this is what I need to be doing. Can someone explain if this is correct and why this is the case.
Thank you!
` tcgalist <- c("TCGA-BRCA")
###############
Definy query that contains samples of interest, aligned against hg19 (using legacy = TRUE) ===============
query_mRNA.hg19 <- GDCquery(project=tcgalist, data.category = "Gene expression", data.type = "Gene expression quantification", platform = "Illumina HiSeq", file.type = "results", experimental.strategy = "RNA-Seq", sample.type = c("Primary Tumor"), legacy = TRUE)
query_clinical <- GDCquery(project=tcgalist, data.category = "Clinical", file.type = "xml")
Download all TCGA gene expression samples using query
GDCdownload(query_mRNA.hg19, method = "client") GDCdownload(query_clinical, method = "client")
Prepare data
data.hg19.mRNA <- GDCprepare(query_mRNA.hg19, save = F)
data.clinical = GDCprepare_clinic(query_clinical, clinical.info = "patient")
check content
ncol(data.hg19.mRNA) # 1095 nrow(data.clinical) # 1174
data.clinical$bcr_patient_barcode[1:5] # "TCGA-3C-AAAU" "TCGA-3C-AALI" "TCGA-3C-AALJ" "TCGA-3C-AALK" "TCGA-4H-AAAK" data.hg19.mRNA$bcr_patient_barcode[1:5] # "TCGA-A8-A08S-01A" "TCGA-S3-AA11-01A" "TCGA-C8-A1HL-01A" "TCGA-BH-A42T-01A" "TCGA-A8-A09T-01A" `