AlexsLemonade / OpenPBTA-analysis

The analysis repository for the Open Pediatric Brain Tumor Atlas Project
Other
99 stars 67 forks source link

Proposed Analysis: GISTIC vs. focal-cn-file-preparation comparison #560

Open cbethell opened 4 years ago

cbethell commented 4 years ago

What are the scientific goals of the analysis?

The scientific goal of the analysis is to gather evidence needed to validate whether or not the way we handle the copy number data in this project is reasonable and the best possible way in which we could handle it.

To do this, we compare our calls to GISTIC's calls. More specifically, we want to compare the GISTIC gene level status calls (and cytoband status calls later) with the focal CN calls we prepare in the focal-cn-file-preparation module. We know that GISTIC is a tool that is widely used, and although it relies on recurrence and may not be appropriate for every histology we have in our cohort, we want to see if our focal-cn-file-preparation analysis gives us the same answer that GISTIC would. Once, we've collected this evidence, we will get an expert's opinion on the interpretation of the evidence.

What methods do you plan to use to accomplish the scientific goals?

I plan to execute this analysis in multiple steps/notebooks:

These steps were broadly attempted in open PR #559. This PR will now be adapted to implement step 1 of the plan above and a separate PR will be filed to address step 2.

What input data are required for this analysis?

The input data required for this includes:

How long do you expect is needed to complete the analysis? Will it be a multi-step analysis?

Step 1 ~1 day Step 2 ~2 days (rough estimates)

Who will complete the analysis (please add a GitHub handle here if relevant)?

@cbethell

What relevant scientific literature relates to this analysis?

GISTIC's docs

jaclyn-taroni commented 4 years ago

@cbethell can you say a little bit more about what the output you mention in the quote below will look like:

The first notebook will tidy the GISTIC data files (all_lesions.conf_90.txt, amp_genes.conf_90.txt, and del_genes.conf_90.txt). This is done to format the GISTIC data in a way that is comparable with the gene level calls in the focal-cn-file-preparation/results files. The output will therefore be a table that can be consumed in the second notebook for comparison to our focal CN files.

Either a sketch or example markdown table work

cbethell commented 4 years ago

@cbethell can you say a little bit more about what the output you mention in the quote below will look like:

The first notebook will tidy the GISTIC data files (all_lesions.conf_90.txt, amp_genes.conf_90.txt, and del_genes.conf_90.txt). This is done to format the GISTIC data in a way that is comparable with the gene level calls in the focal-cn-file-preparation/results files. The output will therefore be a table that can be consumed in the second notebook for comparison to our focal CN files.

Either a sketch or example markdown table work

@jaclyn-taroni does the following update to the original comment now seem suffice?

    1. The first notebook will tidy the GISTIC data files (all_lesions.conf_90.txt, amp_genes.conf_90.txt, and del_genes.conf_90.txt). This is done to format the GISTIC data in a way that is comparable with the gene level calls in the focal-cn-file-preparation/results files. The output will therefore be a table that can be consumed in the second notebook for comparison to our focal CN files. The table will tentatively have the following columns:

      gene_symbol Kids_First_Biospecimen_ID status detection_peak
      TBXAS1 BS_xxxxxx gain Amplification Peak 3

      where gene_symbol values are retrieved from the amp_genes.conf_90.txt/del_genes.conf_90.txt files, Kids_First_Biospecimen_ID and status values are retrieved from the all_lesions.conf_90.txt file, and detection_peak values are matched between the amp/del files and the all_lesions file.

jaclyn-taroni commented 4 years ago

Yep, looks good to me. Thanks for updating!