Interpreting EHDN motif-based analysis results

Illumina / ExpansionHunterDenovo

A suite of tools for detecting expansions of short tandem repeats

Other

77 stars 25 forks source link

Interpreting EHDN motif-based analysis results #49

Open gspirito opened 2 years ago

gspirito commented 2 years ago

Hi, I wanted to ask some questions about the motif-based outlier analysis.

I have a cohort of 40 individuals (WGS), and I suspect that one of them may have am increased burden of repeat expansions compared to the other samples. Since I am not looking for expansions at specific loci I did an outlier motif-based analysis labeling all samples as "case" in the manifest file.

As a result I have that one sample has 44 repeat motifs with Z-score > 3, while all other samples have between 0 and 5 motifs with Z-score > 3. Would it make sense to use this result as suggestive evidence for a general increased burden of repeat expansions in that sample? What would be a suitable Z-score cutoff value?

Thank you in advance.

mfbennett commented 2 years ago

Hey @gspirito. Thanks for trying EHdn and I hope that you find it useful!

I have been thinking about this question and your approach to testing for an increased burden of repeat expansions using the motif-based rather than locus-based analysis sounds reasonable. However, based on what you described I wouldn’t take this to anything more than suggestive evidence. You might also consider try some other (complementary) approaches, which may help give addition supporting evidence.

One idea would be to run a PCA and see if this sample is an outlier compared to the rest of your cohort. You could convert the motif normalised paired-IRR counts to a matrix to to do this. However if you go down this route you may be better served running ExpansionHunter using a genome wide catalog. (@egor-dolzhenko may have some additional thoughts on this.)

gspirito commented 2 years ago

Hi @mfbennett , thank you for the reply, I will try to do some PCAs with the motif normalised paired-IRR counts.

Regarding the analysis with ExpansionHunter and a catalog I have a few questions:

The default catalog has 31 loci (~/ExpansionHunter/variant_catalog/grch38/variant_catalog.json), is there a way to obtain a bigger catalog with much more loci? For example, can I covert this bed file https://s3.amazonaws.com/gangstr/hg38/genomewide/hg38_ver13.bed.gz to .json ad use it as the input catalog?
ExpansionHunter gives me the number of spanning and inrepeat reads for each locus, is there also a way to get the normalized counts?

Thank you

egor-dolzhenko commented 2 years ago

Hi @gspirito. You can get a genome-wide STR catalog for ExpansionHunter here: https://github.com/Illumina/RepeatCatalogs/releases/tag/v1.0.0. This catalog contains repeats with similar properties to known pathogenic repeats (polymorphism, complexity of the sequence surrounding the repeat, etc.)

You could normalize the read counts by dividing each count by the locus depth (which ExpansionHunter reports) and then multiplying by the target depth. For example, if the number of in-repeat reads is 20 and the locus depth is 32x, the corresponding count normalized to 40x depth is 20 * 40 / 32 = 25. (Note that this very simplistic normalization procedure is best used when the depths are pretty similar in all the samples.)

bucongfan commented 2 years ago

Hi @mfbennett, Thank you for your advise. I follow this and using 1kgp and my datasets norm_num_paired_irrs for PCA to find outlier. However, I found that PCA distinguishes between data sets rather than real data. Is it necessary to remove batch effect in the process of PCA? I used norm_num_paired_IRRS, do I still need to normalized the coverage depth as egor-dolzhenko described?