Open cbethell opened 4 years ago
@cbethell can you say a little bit more about what the output you mention in the quote below will look like:
The first notebook will tidy the GISTIC data files (
all_lesions.conf_90.txt
,amp_genes.conf_90.txt
, anddel_genes.conf_90.txt
). This is done to format the GISTIC data in a way that is comparable with the gene level calls in the focal-cn-file-preparation/results files. The output will therefore be a table that can be consumed in the second notebook for comparison to our focal CN files.
Either a sketch or example markdown table work
@cbethell can you say a little bit more about what the output you mention in the quote below will look like:
The first notebook will tidy the GISTIC data files (
all_lesions.conf_90.txt
,amp_genes.conf_90.txt
, anddel_genes.conf_90.txt
). This is done to format the GISTIC data in a way that is comparable with the gene level calls in the focal-cn-file-preparation/results files. The output will therefore be a table that can be consumed in the second notebook for comparison to our focal CN files.Either a sketch or example markdown table work
@jaclyn-taroni does the following update to the original comment now seem suffice?
The first notebook will tidy the GISTIC data files (
all_lesions.conf_90.txt
,amp_genes.conf_90.txt
, anddel_genes.conf_90.txt
). This is done to format the GISTIC data in a way that is comparable with the gene level calls in thefocal-cn-file-preparation/results
files. The output will therefore be a table that can be consumed in the second notebook for comparison to our focal CN files. The table will tentatively have the following columns:
gene_symbol Kids_First_Biospecimen_ID status detection_peak TBXAS1 BS_xxxxxx gain Amplification Peak 3 where
gene_symbol
values are retrieved from theamp_genes.conf_90.txt
/del_genes.conf_90.txt
files,Kids_First_Biospecimen_ID
andstatus
values are retrieved from theall_lesions.conf_90.txt
file, anddetection_peak
values are matched between theamp
/del
files and theall_lesions
file.
Yep, looks good to me. Thanks for updating!
What are the scientific goals of the analysis?
The scientific goal of the analysis is to gather evidence needed to validate whether or not the way we handle the copy number data in this project is reasonable and the best possible way in which we could handle it.
To do this, we compare our calls to GISTIC's calls. More specifically, we want to compare the GISTIC gene level status calls (and cytoband status calls later) with the focal CN calls we prepare in the
focal-cn-file-preparation
module. We know that GISTIC is a tool that is widely used, and although it relies on recurrence and may not be appropriate for every histology we have in our cohort, we want to see if ourfocal-cn-file-preparation
analysis gives us the same answer that GISTIC would. Once, we've collected this evidence, we will get an expert's opinion on the interpretation of the evidence.What methods do you plan to use to accomplish the scientific goals?
I plan to execute this analysis in multiple steps/notebooks:
The first notebook will tidy the GISTIC data files (
all_lesions.conf_90.txt
,amp_genes.conf_90.txt
, anddel_genes.conf_90.txt
). This is done to format the GISTIC data in a way that is comparable with the gene level calls in thefocal-cn-file-preparation/results
files. The output will therefore be a table that can be consumed in the second notebook for comparison to our focal CN files. The table will tentatively have the following columns:where
gene_symbol
values are retrieved from theamp_genes.conf_90.txt
/del_genes.conf_90.txt
files,Kids_First_Biospecimen_ID
andstatus
values are retrieved from theall_lesions.conf_90.txt
file, anddetection_peak
values are matched between theamp
/del
files and theall_lesions
file.This notebook will also produce a separate table with GISTIC's cytoband data for comparison to our cytoband status calls once they are generated (#497). The table will have the following columns:
amp_genes.conf_90.txt
anddel_genes.conf_90.txt
files). The output of the notebook will look similar to this sketch outlined by @jaclyn-taroni:These steps were broadly attempted in open PR #559. This PR will now be adapted to implement step 1 of the plan above and a separate PR will be filed to address step 2.
What input data are required for this analysis?
The input data required for this includes:
analyses/focal-cn-file-preparation/results/consensus_seg_annotated_cn_autosomes.tsv.gz
analyses/run-gistic/results/pbta-cnv-consensus-lgat-gistic.zip
analyses/run-gistic/results/pbta-cnv-consensus-gistic.zip
How long do you expect is needed to complete the analysis? Will it be a multi-step analysis?
Step 1 ~1 day Step 2 ~2 days (rough estimates)
Who will complete the analysis (please add a GitHub handle here if relevant)?
@cbethell
What relevant scientific literature relates to this analysis?
GISTIC's docs