BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
289 stars 110 forks source link

Duplicated cases when querying TCGA-CHOL mutation data #520

Open Dan-H-Jacobson opened 2 years ago

Dan-H-Jacobson commented 2 years ago

Hi there,

I want to extract some mutation data and have been following the following documentation: https://github.com/BioinformaticsFMRP/TCGAbiolinks/issues/new. However, even following the TCGA-CHOL example in the documentation, GDCquery() is reporting multiple files per case, and so GDCprepare(query) does not work. When I looked closer, it seemed that the case names are missing.

library(TCGAbiolinks)
packageVersion('TCGAbiolinks')
[1] ‘2.25.0’

Here is the code and output received from GDCquery()

query <- GDCquery(
   project = "TCGA-CHOL", 
   data.category = "Simple Nucleotide Variation", 
   access = "open", 
   legacy = FALSE, 
   data.type = "Masked Somatic Mutation", 
   workflow.type = "Aliquot Ensemble Somatic Variant Merging and Masking")
--------------------------------------
o GDCquery: Searching in GDC database
--------------------------------------
Genome of reference: hg38
--------------------------------------------
oo Accessing GDC. This might take a while...
--------------------------------------------
ooo Project: TCGA-CHOL
--------------------
oo Filtering results
--------------------
ooo By access
ooo By data.type
ooo By workflow.type
----------------
oo Checking data
----------------
ooo Check if there are duplicated cases
Warning: There are more than one file for the same case. Please verify query results. You can use the command View(getResults(query)) in rstudio
ooo Check if there results for the query
-------------------
o Preparing output
-------------------

And just to compare cases from the query results:


table(getResults(query)$cases)

51 

Is there any way of resolving this? Thanks in advance!

tiagochst commented 2 years ago

It is working for me: https://rpubs.com/tiagochst/TCGAbiolinks_issue_520 The "duplicated cases" is just a warning, since for each MAF file both matched tumor and normal samples were used to produce the file.