Guide on prioritization of outliers

geneticsmcgill commented 4 years ago

Hello there,

Is there a guide for prioritizing outliers? i.e. focusing on STRs with a Z-score >10? How do we filter from all the background noise?

egor-dolzhenko commented 4 years ago

Thank you for the question! Unfortunately, there is no guide on prioritizing outliers yet. We will add a task to our development roadmap to create such a guide and build additional tools for filtering EHdn results.

A search for pathogenic repeats is a very complex task and the current version of EHdn aims to address the initial step of generating a list of candidate expansions. According to my observations, loci with top Z-scores tend to correspond to real expansion events (which, of course, might have no functional consequences). Thus it seems reasonable to prioritize EHdn findings by proximity to exons (or other functional genomic regions) and by the shape of the count distribution (pathogenic repeats might be less likely to correspond to count distributions with very long tails in healthy controls).

We are also working on enabling EHdn to annotate precise reference coordinates of repeats. This will make it possible to perform follow up analyses with targeted methods and apply more aggressive filters.

Perhaps @hrafehi, @mfbennett, or @Phillip-a-richmond might be able to offer additional advice. They all have extensive experience with these types of analyses.

-Egor

hrafehi commented 4 years ago

Hi there,

I have tested some different ways of prioritizing the results from the outlier scripts with some success.

Assuming you are looking for potentially pathogenic expansions in a a disease cohort, I would suggest first annotating the output data and removing all intergenic STRs, and focusing your analysis on STRs that intersect genes, especially at 5'UTR, exons or introns.

Additional filtering that I find useful is to filter by motif length (e.g. focus first on motifs 2-6 bps), and also filter for motifs already associated with disease (CAG, AAGGG, CCG, AAAGT etc). This will remove a lot of background noise.

Another way to filter is to run the outlier script by motif rather than locus. This will tell you if any specific motif is enriched in individuals samples. For example, if CAGs are enriched in sample X, you can go back to the locus specific script and filter for STRs with that motif in sample X.

Even after filtering, your short list will still be pretty long. The next step is to take your short list and create input files for use other tools, such as Expansion Hunter or exSTRa to computationally confirms your findings. These tools will confirm if any sample is indeed an outlier for that STRs. Expansion Hunter will estimate the allele size and exSTRa will detect outliers and if the STR is heterozygous or homozygous. We describe this method in the following papers:

https://www.biorxiv.org/content/10.1101/851675v2.full https://www.ncbi.nlm.nih.gov/pubmed/31230722

I hope that helps!

Haloom

geneticsmcgill commented 4 years ago

Hi there,

I have tested some different ways of prioritizing the results from the outlier scripts with some success.

Assuming you are looking for potentially pathogenic expansions in a a disease cohort, I would suggest first annotating the output data and removing all intergenic STRs, and focusing your analysis on STRs that intersect genes, especially at 5'UTR, exons or introns.

Additional filtering that I find useful is to filter by motif length (e.g. focus first on motifs 2-6 bps), and also filter for motifs already associated with disease (CAG, AAGGG, CCG, AAAGT etc). This will remove a lot of background noise.

Another way to filter is to run the outlier script by motif rather than locus. This will tell you if any specific motif is enriched in individuals samples. For example, if CAGs are enriched in sample X, you can go back to the locus specific script and filter for STRs with that motif in sample X.

Even after filtering, your short list will still be pretty long. The next step is to take your short list and create input files for use other tools, such as Expansion Hunter or exSTRa to computationally confirms your findings. These tools will confirm if any sample is indeed an outlier for that STRs. Expansion Hunter will estimate the allele size and exSTRa will detect outliers and if the STR is heterozygous or homozygous. We describe this method in the following papers:

https://www.biorxiv.org/content/10.1101/851675v2.full https://www.ncbi.nlm.nih.gov/pubmed/31230722

I hope that helps!

Haloom

Thank you both for the help! Do you think it is important to filter on whether controls show an STR there? (i.e. if the control set of ~500 has no STRs in that locus, but 2 cases have an expansion). Would it likely be noise?

Phillip-a-richmond commented 4 years ago

The only additional thing to add here is that from the simulation results, we see that setting a cutoff of ~10 for the z-score will identify nearly all the expansions of size > 150bp (see attached figure), which corresponds to a ranking of top 10. The simulation results can be found on our most recent version of the paper on biorxiv (https://www.biorxiv.org/content/10.1101/863035v2).

Look at supplemental tables 3, 4 and 7 for simulations of pathogenic events, and supplemental tables 5 and 6 for simulations of nonpathogenic events, with potential pathogenicity. Each of these tables details the zscore for the expanded allele (which were used to make the plot I show below).

Regarding filter on control samples, many of the expanded pathogenic alleles have no expansions at pathogenic loci which are picked up by EHdn. So if anything, a lack of expansion in controls and presence in the cases is a good sign of a potential pathogenic expansion.

SizevsZscorePval_EHDN_locus

Phillip-a-richmond commented 4 years ago

@geneticsmcgill, this new documentation may help answer some of your questions regarding thresholds and mechanisms to prioritize the locus output.

https://github.com/Illumina/ExpansionHunterDenovo/blob/master/documentation/09_Repeat_prioritization_strategies.md

Your input on whether this is helpful would be much appreciated.

Best, Phil

statsgirl1 commented 4 years ago

@Phillip-a-richmond Thank you for this! Sorry to jump in. The documentation is definitely helpful. I've been playing around with this program more lately and wonder if there is going to a more automated way to go from candidate lists to a json variant file for ExpansionHunter input?

Additionally, I think if the counts column were divided into case and control it could be useful for identifying intermediate sized expansions in cases/borderline pathogenic.

egor-dolzhenko commented 4 years ago

@statsgirl1 Thanks for reaching out! We are working on a tool to transform repeat regions output by ExpansionHunter Denovo into input files for ExpansionHunter. Would you be interested in trying out an early version of the tool? If yes, we could send it to you as soon as it's ready.

statsgirl1 commented 4 years ago

Yes for sure, that would be of great interest!

On Tue, Jul 28, 2020 at 12:20 AM Egor Dolzhenko notifications@github.com wrote:

@statsgirl1 https://github.com/statsgirl1 Thanks for reaching out! We are working on a tool to transform repeat regions output by ExpansionHunter Denovo into input files for ExpansionHunter. Would you be interested in trying out an early version of the tool? If yes, we could send it to you as soon as it's ready.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Illumina/ExpansionHunterDenovo/issues/5#issuecomment-664767952, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQGIZN7UDVS5LGNJB2RWU3DR5ZGZ5ANCNFSM4J26M55A .

francesca-lucas commented 3 years ago

Thank you for the wonderful tools you have developed! I just stumbled upon this thread, and I had exactly the same question/goal as statsgirl1: I would like to transform the ExpansionHunter Denovo output into variant catalog files for ExpansionHunter. @egor-dolzhenko, would it be possible for me to try out this transformation tool as well?

I don't mind at all if it is not finished, anything you have got so far would be most helpful. :)

egor-dolzhenko commented 3 years ago

@fransje-lucas Thank you for using our tools! And yes, happy to share what we have. Could you please send me an email (edolzhenko@illumina.com)?

francesca-lucas commented 3 years ago

@egor-dolzhenko Wonderful, I have sent you an email. Thank you so much!

Illumina / ExpansionHunterDenovo

Guide on prioritization of outliers #5