BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
284 stars 109 forks source link

different sample size between TCGA portal and TCGAbiolinks package #605

Open tyasird opened 9 months ago

tyasird commented 9 months ago

I was looking for the mutation data through TCGA portal using TCGAbiolinks and I have realized that sample size are not the same.

for instance TCGA-OV case TCGA data portal shows 419 cases, however TCGAbiolinks shows 462 samples. File counts are the same for both it is 482.

so why it is different?

this my query in TCGA data portal:

cases.project.project_id in ["TCGA-OV"] and files.analysis.workflow_type in ["Aliquot Ensemble Somatic Variant Merging and Masking"] and files.data_category in ["Simple Nucleotide Variation"] and files.data_type in ["Masked Somatic Mutation"]

this is same query in the TCGAbiolinks package:

#query
query <- GDCquery(
  project = "TCGA-OV", 
  data.category = "Simple Nucleotide Variation", 
  access = "open",
  data.type = "Masked Somatic Mutation", 
  workflow.type = "Aliquot Ensemble Somatic Variant Merging and Masking"
)

#download & read
GDCdownload(query)
maf <- GDCprepare(query)
mafr = maftools::read.maf(maf)
mutations = mafSummary(mafr)
print(as.numeric(mafr@summary[mafr@summary$ID=="Samples"]$summary))
tiagochst commented 9 months ago

Some of the maf files are empty Here is one example: https://portal.gdc.cancer.gov/files/73cec020-9d79-4189-8ae7-b6be0c867371

[image: Screenshot 2023-10-12 at 12.35.40 PM.png]

Message ID: @.***>

tyasird commented 8 months ago

@tiagochst I still don't understand, Are counts not suppose to be higher in the TCGA portal? Why it is higher in the TCGAbiolinks results? Or another way to ask this question, how I can reach the equal sample size in the TCGA portal?

tiagochst commented 8 months ago

Hi,

Yes, file counts are the same for both; it is 482. Please, where do 462 samples come from?

TCGAbiolinks shows 407 patients while GDC shows 419 cases. 407 comes from: unique(substr(maf$Tumor_Sample_Barcode,1,12)) %>% length And 419 cames from the GDC portal:

[image: Screenshot 2023-10-13 at 10.50.33 AM.png]

The difference should be the ones with files but no SNV.

On Fri, Oct 13, 2023 at 10:27 AM Yasir Demirtaş @.***> wrote:

@tiagochst https://github.com/tiagochst I still don't understand, Are counts not suppose to be higher in the TCGA portal? Why it is higher in the TCGAbiolinks results? Or another way to ask this question, how I can reach the equal sample size in the TCGA portal?

— Reply to this email directly, view it on GitHub https://github.com/BioinformaticsFMRP/TCGAbiolinks/issues/605#issuecomment-1761615115, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQ6P5DNPF2WU2AT6O74DX7FFUZANCNFSM6AAAAAA55WOE7Q . You are receiving this because you were mentioned.Message ID: @.***>

tyasird commented 8 months ago

@tiagochst

Thanks for your answer. I use maftools for that, and there is a summary variable/table inside of the read.maf function. So I just open that table and for TCGA-OV it shows 462 sample. I am sharing the screenshot with you. Also this is the TCGA query

GDCdownload(query)
maf <- GDCprepare(query)
mafr = maftools::read.maf(maf)
print(as.numeric(mafr@summary[mafr@summary$ID=="Samples"]$summary))

image