Illumina / ExpansionHunterDenovo

A suite of tools for detecting expansions of short tandem repeats
Other
78 stars 25 forks source link

Inaccurate copy number #45

Open oleraj opened 2 years ago

oleraj commented 2 years ago

Hi,

We ran the example dataset with StrA and StrB and then compared the results for case-control to the original reads and it appears the copy number estimate is maybe a little underestimated. Perhaps you can provide some insight into what we're seeing?

Here are the results we got from the case-control analysis:

head example_dataset.casecontrol_locus.tsv
contig  start   end motif   pvalue  bonf_pvalue counts
StrA    1723    2123    AGC 0.03854993587177091 0.07709987174354183 sample1:8.461518151815183,sample2:33.28581629341123,sample3:30.048237956676367,sample4:5.298941798941799,sample5:9.576095617529882
StrB    1684    2154    CCG 0.07864960352514261 0.15729920705028522 sample1:22.211485148514853,sample2:22.88399870172022,sample3:31.08438409311348,sample4:29.674074074074078,sample5:20.21620185922975,sample6:21.3226879574185,sample7:20.358141089936474

Here's a screenshot from IGV: image

If you notice for sample3 in the results above it says there are 30 copies of the AGC motif on StrA. However in IGV, the trimmed repeat portion is 130bp (130S20M CIGAR for left read in the pair), which would correspond to > 40 copies. I would assume the tool should be able to use the maximum length from any given read to determine the copy number. Is that not how it works?

Thanks,

Andrew

egor-dolzhenko commented 2 years ago

Hello Andrew,

Thank you for the question. EHdn performs a relatively simple analysis by identifying and counting reads that are fully contained inside the repeat sequence. Although there is a linear relationship between these counts and the repeat sizes, EHdn does not estimate repeat sizes because it does not know if the expansion occurred on one or both alleles and because it does not know the exact (single-nucleotide resolution) reference coordinates of the repeat.

To summarize, EHdn just determines approximate coordinates of long repeats and produces a rough measure of their length. This information can be used to detect the presence of an expansion but not the exact repeat genotype. In the highlighted example, EHdn found ~30 reads consisting of the AGC motif on StrA.

We are starting to work on the next version of EHdn that will generate much more intuitive results (including the exact repeat sizes).

Relatedly, we recently released a genome-wide repeat catalog compatible with the latest version of (regular) ExpansionHunter. We believe that this catalog contains a good set of targets for studies aimed at discovering new pathogenic / functionally important repeats (though we are still working on a document explaining why we think so).

Please let me know if anything about my answer is unclear or if you have any other questions. If you'd like to discuss the specific dataset you are working with, please feel free to reach out directly by email.

Best wishes, Egor