TCGA mutational status information: which GDC database version did you use?

ChristianRohde commented 1 year ago

Hi,

this is seems to be a great resource. This is rather a question and not a issue report. May I ask how/when exactly you compiled the information for the TCGA dataset? In your publication you write "Clinical and mutational data were extracted from the GDC Data Portal for TCGA (https://portal.gdc.cancer.gov/projects/TCGA-LAML)". I ask because I fail to reproduce the mutational status of samples with the current database versions. I downloaded files manually in 2022 from GDC and also tried freshly with the TCGAbiolinks R package. Unfortunately, so far I did not figure out what is wrong with my approach. My aim is to establish a workflow I trust for new cohorts from other cancer entities or Beat-AML.

Here are some more details: Interestingly, I still have a TCGA LAML MAF file which I downloaded years ago. Based on the timestamp on my computer it should be from 2015. It has the same filename as listed here (https://docs.google.com/spreadsheets/d/18SS7g6P8QCRL-2uDKS0uvVt_2O9YbtfMcy_ep_SWCus/edit#gid=2) below original URL to MAF. As expected it consists of 197 Tumor IDs. It is mentioned in the comments in the Google Docs file 3 IDs have zero mutation calls. Upon processing with maftools I get the expected mutational spectrum of the TCGA LAML cohort. I raised an issue on the TCGAbiolinks github page: https://github.com/BioinformaticsFMRP/TCGAbiolinks/issues/585. So far I did not get any response.

Now I also compared the information I have (2015 LAML MAF, GDC, TCGAbiolinks) with the supplemental table 41591_2022_1819_MOESM3_ESM.xlsx analysis from your publication. You include the 3 samples without mutations, but skip others. On the other hand I fail to retrieve much more samples when I use TCGAbiolinks or the downloaded MAFs from GDC:

Screenshot 2023-06-28 at 15 02 35

Next, I also compared the mutational status of NPM1. The information about the mutation status you share is in line with the mutation information from my old 2015 LAML MAF (red = MUT, blue = WT, white = NA). On the other hand while TCGAbiolinks and the manual GDC download perfectly overlaps, both are in huge contrast to the mutation status from your table and the 2015 LAML MAF.

Screenshot 2023-06-28 at 15 05 19

Do you have any idea what could be the explanation? Please tell me if I you spot a stupid mistake.

Thank you, Christian

andygxzeng commented 1 year ago

Hi Christian,

Thanks for reaching out and flagging this. You're right that there is a discrepancy between different TCGA annotations from different years, and my actual methods are a bit more nuanced than described in the methods section of the paper.

The mutational data that I used for TCGA was downloaded on Nov 10 2020 from cBioPortal, using the Pan-Cancer Atlas annotations from 2018 for mutation status, and incorporating VAF information from the 2013 study when possible.

Below is the link to the Pan-Cancer Atlas mutation annotations (updated in 2018), if you click download beside the title, it will save the entire directory with clinical and mutational annotations. The file I used in 2020 was called data_mutations_mskcc.txt, but now it appears that it is just called data_mutations.txt. https://www.cbioportal.org/study/summary?id=laml_tcga_pan_can_atlas_2018

Because I did not find VAF information in the Pan Cancer Atlas mutation data, I used VAF information from the original 2013 NEJM paper, which I also downloaded from cbioportal. Clicking download beside the title will give you the same directory, and for this I used data_mutations_mskcc.txt, which now appears to just be called data_mutations.txt. https://www.cbioportal.org/study/summary?id=laml_tcga_pub

Using these two files from the 2018 Pan Cancer Atlas mutation annotations and the 2013 publication mutation annotations, I kept Pan Cancer Atlas mutations predicted to be deleterious by SIFT, and when possible included VAF information from the 2013 annotations. Here is the notebook outlining how I processed the data from the two files in Nov 10, 2020. https://github.com/andygxzeng/AMLHierarchies/blob/main/Data/Fig2_Cohort_Deconvolution/tcga_mutations.ipynb

I hope this helps to clarify some of the confusion. The variability in how these mutations are annotated across the years, likely depending on alignment and variant calling pipelines etc, can be frustrating to deal with and the approach I took was based on the most updated information that I could find pertaining to these samples. I think the annotations for BEAT-AML are much more straightforward as they were generated more recently and not subject to as many iterations of annotations.

Let me know if there is anything else I can help you with and thanks for reaching out!

Andy

andygxzeng commented 1 year ago

Hi Christian,

As an additional note, I am only providing mutational information for the 173 samples which were profiled by RNA-seq in order to link genomics to cellular hierarchies. If you are looking for a comprehensive genomic analysis I would use the pancancer atlas annotations which will be on the full 197. BEAT-AML also came out with v2 which is expanded and will have more genomic information than provided in my supplement, and in addition they will have relapsed AMLs etc which I excluded. Hope that helps!

Andy

andygxzeng / AMLHierarchies

TCGA mutational status information: which GDC database version did you use? #4