AlexsLemonade / OpenPBTA-analysis

The analysis repository for the Open Pediatric Brain Tumor Atlas Project
Other
101 stars 67 forks source link

Planned Analysis: Integrated CNV and SV analyses and chromothripsis #27

Closed jharenza closed 4 years ago

jharenza commented 5 years ago

We have generated CNV output from ControlFreeC and CNVKit, but are seeking individuals to determine consensus focal calls and/or identify additional algorithms we can run to instill high confidence in focal CNV calls from the WGS dataset.

cgreene commented 5 years ago

After https://github.com/AlexsLemonade/OpenPBTA-manuscript/pull/15 is approved and merged, can you write up the CNV methods and file a PR into that subsection so that we can link folks to the current version of the processing code?

It may change in the future, but then we'll have an accurate manuscript-ready description of what was done.

jharenza commented 5 years ago

This machine learning publication may help us with CN true positives:

jharenza commented 5 years ago

After AlexsLemonade/OpenPBTA-manuscript#15 is approved and merged, can you write up the CNV methods and file a PR into that subsection so that we can link folks to the current version of the processing code?

It may change in the future, but then we'll have an accurate manuscript-ready description of what was done.

Yes - will work on getting this filled in by the harmonization team.

gonzolgarcia commented 5 years ago

Integrated CNV and SV analyses and chromothripsis.

The proposed analyses broadly addresses the prevalence and functional impact of structural variation across brain tumors. It is important to note that copy number variations are essentially a subset of structural variants and as such, both CNV and SV calls are highly overlapping and complementary and should be studied together. I am effectively proposing to merge #27 and #28 issues.

In order to integrate CNV calls and SV calls we focus on breakpoint co-locallization, more details in the manuscript: https://www.biorxiv.org/content/10.1101/572248v3

Chromothripsis is a catastrophic one time event involving multiple breakpoints and rearrangements of localized regions in the genome. As opposed to chromoplexia, which involve gradually acquired structural variations. Chromothripsis can be identified by a pattern of oscillating copy number states and concomitant structural variants that allow walking through the newly formed chromosome. In practical terms, It can be identified as regions of abnormally high number of CNVs and SVs. Different available methods; all of which have limitations: ShatterSeek (https://github.com/parklab/ShatterSeek), Shatterproof (https://metacpan.org/release/SGOVIND/Shatterproof-0.13) & No-Name (https://www.biorxiv.org/content/10.1101/572248v3)(Focused on regions which SV density is 2 * std. dev above the average of each sample)

The input format for developing downstream analyses are:

CNV segmentation data: SampleId, chromosome, Start, End, num_probes (depreciated, from SNP array format), Segment_Mean (log T/N )

Allele specific CNV (optional; defining regions of LoH and allelic imbalance) SampleId, chromosome, Start, End, BAF_mean Call (LOH or AI)

SV calls file content: (already filtered by Somatic Score; no need to be annotated) SampleId, Chromosome-origin, Start-origin, End-origin Chromosome-destination, Start-destination, End-destination, sv_type: DEL, DUP, TRA and INV (often divided in head-to-head and tail-to-tail)

Some proposed readouts and output analyses

Structural variation. 1) A measure of chromosomal instability (CIN) burden (density of breakpoints per Mb; similar to tumor mutational burden, TMB) and a plot by tumor type representing CIN burden (this could be compared to TMB). 2) Recurrently altered genes (perhaps integrated in an Oncoprint with SNV?) For the oncoprint categories:

3) Focus on novel findings… If some newly recurrently altered gene arises will analyze in depth

Chromothripsis: 4) A barplot with the frequency of chromothripsis prevalence by tumor subtype 5) A few circus plots with examples of chromothripsis 6) association of chromothripsis with other somatic alterations (i.e. TP53 status)

Survival analyses (probably addressed in issue #18) 7) multivariate analyses including clinical variables as well as overall TMB and chromosomal instability burden and chromothripsis.

jharenza commented 5 years ago

merged #27 and #28 here per @gonzolgarcia's request

gonzolgarcia commented 5 years ago

Issue with lumpy data

As I am trying to filter somatic SVs from the table I realized that the evidence columns "Tumor" and "Normal" are switched.

In addition, there is no somatic score and haven't found much guidelines for somatic filtering of tumor/normal lumpy results. I will be considering this: https://github.com/arq5x/lumpy-sv/issues/268

jharenza commented 5 years ago

Thanks, @gonzolgarcia! You are right, the T/N columns are swapped - we will fix this in V5 release coming next week.

guru-yang commented 5 years ago

The Yang Lab will perform analysis on chromothripsis.

gonzolgarcia commented 5 years ago

The Yang Lab will perform analysis on chromothripsis.

Note that there are two callers for CNV (cnvkit & controlfreek) and SV (manta & lumpy) This dataset still require some further processing and filtering

jaclyn-taroni commented 5 years ago

Note that there are two callers for CNV (cnvkit & controlfreek) and SV (manta & lumpy) This dataset still require some further processing and filtering

@gonzolgarcia are you planning to generate SV consensus calls?

gonzolgarcia commented 5 years ago

Note that there are two callers for CNV (cnvkit & controlfreek) and SV (manta & lumpy) This dataset still require some further processing and filtering

@gonzolgarcia are you planning to generate SV consensus calls?

Before getting a consensus, lumpy requires somatic filtering. It would be nice to have this added to next release

jharenza commented 5 years ago

Note that there are two callers for CNV (cnvkit & controlfreek) and SV (manta & lumpy) This dataset still require some further processing and filtering

@gonzolgarcia are you planning to generate SV consensus calls?

Before getting a consensus, lumpy requires somatic filtering. It would be nice to have this added to next release

@guru-yang - do you have any experience with somatic filtering of LUMPY SVs? The comment referred to here suggests the following:

1) Run SVTyper - docker 2) Filter for somatic calls: a) keep non-reference SVs in the tumor; b) keep SVs which have no alternate depth (AO==0) in normal; c) keep SVs with sufficient depth in the normal (RO>~7)

guru-yang commented 5 years ago

Note that there are two callers for CNV (cnvkit & controlfreek) and SV (manta & lumpy) This dataset still require some further processing and filtering

@gonzolgarcia are you planning to generate SV consensus calls?

Before getting a consensus, lumpy requires somatic filtering. It would be nice to have this added to next release

@guru-yang - do you have any experience with somatic filtering of LUMPY SVs? The comment referred to here suggests the following:

1. Run SVTyper - docker

2. Filter for somatic calls:
   a) keep non-reference SVs in the tumor;
   b) keep SVs which have no alternate depth (AO==0) in normal;
   c) keep SVs with sufficient depth in the normal (RO>~7)

We haven't used LUMPY at all. The filtering steps sounds reasonable. Based on my experience, Manta alone might be good enough for SV calling.

gonzolgarcia commented 5 years ago

l

Note that there are two callers for CNV (cnvkit & controlfreek) and SV (manta & lumpy) This dataset still require some further processing and filtering

@gonzolgarcia are you planning to generate SV consensus calls?

Before getting a consensus, lumpy requires somatic filtering. It would be nice to have this added to next release

@guru-yang - do you have any experience with somatic filtering of LUMPY SVs? The comment referred to here suggests the following:

1. Run SVTyper - docker

2. Filter for somatic calls:
   a) keep non-reference SVs in the tumor;
   b) keep SVs which have no alternate depth (AO==0) in normal;
   c) keep SVs with sufficient depth in the normal (RO>~7)

We haven't used LUMPY at all. The filtering steps sounds reasonable. Based on my experience, Manta alone might be good enough for SV calling.

You're probably right and manta alone + cnvkit should be enough for Shatterseek?

guru-yang commented 5 years ago

You're probably right and manta alone + cnvkit should be enough for Shatterseek?

Should be enough.

jharenza commented 5 years ago

Great! @guru-yang and @gonzolgarcia - you can plan to use Manta + CNVkit for Shatterseek and then we can work on a filtered lumpy data file for release in the next few weeks for general recurrent SV analysis.

jharenza commented 5 years ago

@guru-yang and @gonzolgarcia as an update, we are going to remove LUMPY from the release. SVTyper processing is very long per sample (>10 hours), and will require some benchmarking for filtering, which we have de-prioritized in favor of benchmarking copy number. You have both said Manta is fine, so we will drop it. We will have a data release with new CN results coming next week #146, so please let us know if you need help with creating PRs!

guru-yang commented 5 years ago

@jharenza Thanks for the update. I am wondering how to get sample metadata. We are able to get gender, age at diagnose, tumor type from Kids First data portal. In order to perform survival analysis, age at last follow up would be needed. Do you know how to get that information? Are there any other information available for the patients, or their parents, such as smoking, alcohol consumption of the parents?

cgreene commented 5 years ago

@guru-yang : have you examined the metadata available in the files associated with this project? Once you do, could you file a new issue noting anything that's missing that you'd need for your analysis? Thanks!

guru-yang commented 5 years ago

@cgreene I am able to find overall survival in pedcbioportal. Thanks.

jaclyn-taroni commented 5 years ago

Hi @guru-yang - overall survival, gender, age at diagnosis, and tumor type are all available in the pbta-histologies.tsv file that are part of the data files that are obtained by running the download-data.sh script.

We need people to use that file when putting together their analyses because that ensures that different contributors that are working independently are using the same information across their analyses (e.g., the same overall survival values). If there are additional fields you would like to see in the pbta-histologies.tsv file, please file a new issue requesting that information. Thank you!

jharenza commented 5 years ago

@jharenza Thanks for the update. I am wondering how to get sample metadata. We are able to get gender, age at diagnose, tumor type from Kids First data portal. In order to perform survival analysis, age at last follow up would be needed. Do you know how to get that information? Are there any other information available for the patients, or their parents, such as smoking, alcohol consumption of the parents?

@guru-yang as @jaclyn-taroni mentioned, the survival is in the provided histologies file in the data download. It is better to use this file, as we have further categorized tumors and provided additional data not in the KF portal. We do not have age at last followup in the file currently, but it can be added in the release due next week. Can you please file an issue for that? We have no parental information available, but if there are other things you would like to see from patients, you can also ask in an issue and I can check whether we have the info available.

guru-yang commented 5 years ago

@jaclyn-taroni @jharenza I see. Thanks a lot. What about smoking and alcohol usage for the probands? I don't expect smokers in pediatric cohort. Just curious.

cgreene commented 5 years ago

@guru-yang : please file a new github issue with requests for metadata so that we can keep this issue, currently titled "Planned Analysis: Integrated CNV and SV analyses and chromothripsis" on that topic. Thanks!

jharenza commented 5 years ago

Hi @gonzolgarcia and @guru-yang! When do you think you will be able to file a pull request with either of your analyses? Thanks!

guru-yang commented 5 years ago

@jharenza We have made some progress. Is there a regular conference call or similar to share results among the group? Or everything is through github?

jaclyn-taroni commented 5 years ago

Hi @guru-yang, great to hear! We encourage you to file pull requests adding the code used to generate results as you have them. The analysis does not need to be complete before getting added to the repository. We have a pull request template with a section for summarizing results to facilitate discussion. You can join the Cancer Data Science Slack #open-pbta channel (more information here) if you have questions about the pull request model that are better answered in real-time.

cgreene commented 5 years ago

I will echo @jaclyn-taroni and @jharenza : please file pull requests adding code as you are writing it. It is much harder to integrate a large amount of code after it is entirely written. Thanks!

guru-yang commented 5 years ago

@jharenza @jaclyn-taroni @cgreene Will try to do that soon. I am traveling this week. One quick question, we have seen quite some patients with more than one tumors sequenced. When working on variants, is there a particular strategy to handle these tumors? Such as randomly pick one?

jashapiro commented 5 years ago

As of the v7 release, we now provide lists of independent specimens (one tumor per individual) that we would like analyses to use. These are randomly selected, as you suggest, but this allows everyone to use consistent sets. See the bottom of the Data Formats section of the README for descriptions of those files.

guru-yang commented 5 years ago

I noticed in some samples the CNV calls from two algorithms are quite different. I wonder what's the plan going forward. It seems to me generating a consensus CNV call is not easy.

jaclyn-taroni commented 5 years ago

Hi @guru-yang - have you taken a look at the copy number consensus issue: https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/128?

gonzolgarcia commented 5 years ago

Hello everyone, I wanted to apologize for my lack of contribution to this issue, which I proposed initially. Unfortunately the requirements of my new position at Mount Sinai have let me with very little time bandwidth. For the time being I cannot guaranty that I will contributing steadily to this issue. However, I'd be happy to provide support if still needed as I am working on developing new tools for the integrated analysis of CNVs and structural variations. Best regards to everyone.

jaclyn-taroni commented 4 years ago

I filed two more focused issues based on what analyses are in progress vs. those that are not currently accounted for: #393 and #394

Closing this.