Open tyasird opened 9 months ago
Some of the maf files are empty Here is one example: https://portal.gdc.cancer.gov/files/73cec020-9d79-4189-8ae7-b6be0c867371
[image: Screenshot 2023-10-12 at 12.35.40 PM.png]
Message ID: @.***>
@tiagochst I still don't understand, Are counts not suppose to be higher in the TCGA portal? Why it is higher in the TCGAbiolinks results? Or another way to ask this question, how I can reach the equal sample size in the TCGA portal?
Hi,
Yes, file counts are the same for both; it is 482. Please, where do 462 samples come from?
TCGAbiolinks shows 407 patients while GDC shows 419 cases. 407 comes from: unique(substr(maf$Tumor_Sample_Barcode,1,12)) %>% length And 419 cames from the GDC portal:
[image: Screenshot 2023-10-13 at 10.50.33 AM.png]
The difference should be the ones with files but no SNV.
On Fri, Oct 13, 2023 at 10:27 AM Yasir Demirtaş @.***> wrote:
@tiagochst https://github.com/tiagochst I still don't understand, Are counts not suppose to be higher in the TCGA portal? Why it is higher in the TCGAbiolinks results? Or another way to ask this question, how I can reach the equal sample size in the TCGA portal?
— Reply to this email directly, view it on GitHub https://github.com/BioinformaticsFMRP/TCGAbiolinks/issues/605#issuecomment-1761615115, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQ6P5DNPF2WU2AT6O74DX7FFUZANCNFSM6AAAAAA55WOE7Q . You are receiving this because you were mentioned.Message ID: @.***>
@tiagochst
Thanks for your answer. I use maftools for that, and there is a summary variable/table inside of the read.maf function. So I just open that table and for TCGA-OV it shows 462 sample. I am sharing the screenshot with you. Also this is the TCGA query
GDCdownload(query)
maf <- GDCprepare(query)
mafr = maftools::read.maf(maf)
print(as.numeric(mafr@summary[mafr@summary$ID=="Samples"]$summary))
I was looking for the mutation data through TCGA portal using TCGAbiolinks and I have realized that sample size are not the same.
for instance TCGA-OV case TCGA data portal shows 419 cases, however TCGAbiolinks shows 462 samples. File counts are the same for both it is 482.
so why it is different?
this my query in TCGA data portal:
cases.project.project_id in ["TCGA-OV"] and files.analysis.workflow_type in ["Aliquot Ensemble Somatic Variant Merging and Masking"] and files.data_category in ["Simple Nucleotide Variation"] and files.data_type in ["Masked Somatic Mutation"]
this is same query in the TCGAbiolinks package: