BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
289 stars 110 forks source link

Sample type comes in pairs for SNV and CNV in TCGA-BRCA #518

Closed pilargmarch closed 9 months ago

pilargmarch commented 2 years ago

Hi!

I'm querying all TCGA-BRCA samples for SNV (Simple Nucleotide Variation) and CNV (Copy Number Variation) with the following code:

query.snv <- GDCquery( project = "TCGA-BRCA", data.category = "Simple Nucleotide Variation", experimental.strategy = "WXS", workflow.type = "Aliquot Ensemble Somatic Variant Merging and Masking", data.type = "Masked Somatic Mutation", data.format = "MAF" )

query.cnv <- GDCquery( project = "TCGA-BRCA", data.category = "Copy Number Variation", data.type = "Gene Level Copy Number", )

Everything seems to run fine. However, I don't understand why the sample types are listed in pairs.

For SNV, table((getResults(query.snv))$sample_type) gives the following output (in a table):

And for CNV, table((getResults(query.cnv))$sample_type) yields:

I can (kind of?) see why this would be the case for SNV, since mutation info comes from tumor-normal aliquot pairs. I've gone onto GDC and downloaded a single .MAF file from a random case, and there seems to be data only for Tumor_Seq_Alleles (notice how the Match_Norm_Seq_Alleles columns are empty). Not sure why this is the case.

Screenshot from 2022-05-31 00-21-49

Screenshot from 2022-05-31 00-36-14

Same happens with CNV data. On GDC Data Portal, there's 1 file for each case and it has 2 associated cases: one coming from tumor tissue and another one from normal tissue. However, inside the .TSV file itself there are no references to tumor/normal samples.

What does this mean? Is the mutation data for the tumor tissue, the normal tissue or both?

Many thanks!