Proposed Analysis: GISTIC vs. focal-cn-file-preparation comparison

cbethell commented 4 years ago

What are the scientific goals of the analysis?

The scientific goal of the analysis is to gather evidence needed to validate whether or not the way we handle the copy number data in this project is reasonable and the best possible way in which we could handle it.

To do this, we compare our calls to GISTIC's calls. More specifically, we want to compare the GISTIC gene level status calls (and cytoband status calls later) with the focal CN calls we prepare in the focal-cn-file-preparation module. We know that GISTIC is a tool that is widely used, and although it relies on recurrence and may not be appropriate for every histology we have in our cohort, we want to see if our focal-cn-file-preparation analysis gives us the same answer that GISTIC would. Once, we've collected this evidence, we will get an expert's opinion on the interpretation of the evidence.

What methods do you plan to use to accomplish the scientific goals?

I plan to execute this analysis in multiple steps/notebooks:

1. The first notebook will tidy the GISTIC data files (all_lesions.conf_90.txt, amp_genes.conf_90.txt, and del_genes.conf_90.txt). This is done to format the GISTIC data in a way that is comparable with the gene level calls in the focal-cn-file-preparation/results files. The output will therefore be a table that can be consumed in the second notebook for comparison to our focal CN files. The table will tentatively have the following columns:
  
  gene_symbol Kids_First_Biospecimen_ID status detection_peak
  
  TBXAS1 BS_xxxxxx gain Amplification Peak 3
  
  where gene_symbol values are retrieved from the amp_genes.conf_90.txt/del_genes.conf_90.txt files, Kids_First_Biospecimen_ID and status values are retrieved from the all_lesions.conf_90.txt file, and detection_peak values are matched between the amp/del files and the all_lesions file.
This notebook will also produce a separate table with GISTIC's cytoband data for comparison to our cytoband status calls once they are generated (#497). The table will have the following columns:

cytoband Kids_First_Biospecimen_ID status
1. The second notebook will take the tidy GISTIC data and count the number of samples for an individual histology (the focus will be LGAT for now) that are found in a particular amplification/deletion peak, for the corresponding genes (found in the same amplification/deletion peak according to the data from the amp_genes.conf_90.txt anddel_genes.conf_90.txt files). The output of the notebook will look similar to this sketch outlined by @jaclyn-taroni:

gene_symbol	Kids_First_Biospecimen_ID	status	detection_peak
TBXAS1	BS_xxxxxx	gain	Amplification Peak 3

These steps were broadly attempted in open PR #559. This PR will now be adapted to implement step 1 of the plan above and a separate PR will be filed to address step 2.

What input data are required for this analysis?

The input data required for this includes:

analyses/focal-cn-file-preparation/results/consensus_seg_annotated_cn_autosomes.tsv.gz
analyses/run-gistic/results/pbta-cnv-consensus-lgat-gistic.zip
analyses/run-gistic/results/pbta-cnv-consensus-gistic.zip

How long do you expect is needed to complete the analysis? Will it be a multi-step analysis?

Step 1 ~1 day Step 2 ~2 days (rough estimates)

Who will complete the analysis (please add a GitHub handle here if relevant)?

@cbethell

What relevant scientific literature relates to this analysis?

GISTIC's docs

jaclyn-taroni commented 4 years ago

@cbethell can you say a little bit more about what the output you mention in the quote below will look like:

The first notebook will tidy the GISTIC data files (all_lesions.conf_90.txt, amp_genes.conf_90.txt, and del_genes.conf_90.txt). This is done to format the GISTIC data in a way that is comparable with the gene level calls in the focal-cn-file-preparation/results files. The output will therefore be a table that can be consumed in the second notebook for comparison to our focal CN files.

Either a sketch or example markdown table work

cbethell commented 4 years ago

@cbethell can you say a little bit more about what the output you mention in the quote below will look like:

The first notebook will tidy the GISTIC data files (all_lesions.conf_90.txt, amp_genes.conf_90.txt, and del_genes.conf_90.txt). This is done to format the GISTIC data in a way that is comparable with the gene level calls in the focal-cn-file-preparation/results files. The output will therefore be a table that can be consumed in the second notebook for comparison to our focal CN files.

Either a sketch or example markdown table work

@jaclyn-taroni does the following update to the original comment now seem suffice?

The first notebook will tidy the GISTIC data files (all_lesions.conf_90.txt, amp_genes.conf_90.txt, and del_genes.conf_90.txt). This is done to format the GISTIC data in a way that is comparable with the gene level calls in the focal-cn-file-preparation/results files. The output will therefore be a table that can be consumed in the second notebook for comparison to our focal CN files. The table will tentatively have the following columns:

gene_symbol Kids_First_Biospecimen_ID status detection_peak

TBXAS1 BS_xxxxxx gain Amplification Peak 3

where gene_symbol values are retrieved from the amp_genes.conf_90.txt/del_genes.conf_90.txt files, Kids_First_Biospecimen_ID and status values are retrieved from the all_lesions.conf_90.txt file, and detection_peak values are matched between the amp/del files and the all_lesions file.

gene_symbol	Kids_First_Biospecimen_ID	status	detection_peak
TBXAS1	BS_xxxxxx	gain	Amplification Peak 3

jaclyn-taroni commented 4 years ago

Yep, looks good to me. Thanks for updating!

AlexsLemonade / OpenPBTA-analysis