BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
287 stars 109 forks source link

Samples Missing from MAF Import #113

Closed DarioS closed 7 years ago

DarioS commented 7 years ago

A small number of samples are missing. For example,

mutations <- GDCquery_Maf("SKCM", pipelines = "muse")
> grep("TCGA-EE-A2GK", unique(unlist(mutations[, "Tumor_Sample_Barcode"])))
integer(0)

On the GDC Portal homepage, pasting TCGA-EE-A2GK into the search box shows some files, once of which is a MuSE-processed mutation file, so this sample does indeed have such files available for download.

There are some warnings during the data preparation stage, which might give a hint about the cause.

Warning messages:
1: Unnamed `col_types` should have the same length as `col_names`. Using smaller of the two. 
2: In rbind(names(probs), probs_f) :
  number of columns of result is not a multiple of vector length (arg 1)
tiagochst commented 7 years ago

If you open the muse MAF file you cannot find TCGA-EE-A2GK.

screen shot 2017-07-07 at 11 05 31 am

But you can find it in mutect2 MAF file.

screen shot 2017-07-07 at 11 08 08 am

I'm not sure how GDC is mapping the MAF files to the patients, but the file should not appear in the list of files for that user. I believe it is better to send an email to GDC support team.

DarioS commented 7 years ago

Enquiries with Genomic Data Commons revealed that the samples missing were because of some MuSE filtering criteria and the missingness is intentional and not to be changed.