Illumina / ExpansionHunter

A tool for estimating repeat sizes
Other
182 stars 51 forks source link

Exome Sequencing - Accuracy issues? #144

Open wsproviero opened 3 years ago

wsproviero commented 3 years ago

Hello Igor,

I am currently extracting STR from 1000 Genome WGS and a WES dataset. When comparing the distributions of the STRs used in the catalog between the two datasets, I can see some discrepancies. I do understand the obvious limitation of WGS (read lentgh 100bp) Vs WES (76 bp). Should I tweak any parameter when extracting STRs from WES?

Thank you in advance.

With Kind Regards, William

egor-dolzhenko commented 3 years ago

Hello William,

Great question. It is quite a bit harder to genotype STRs in WES than in PCR-free WGS because of the less even read coverage / amplification biases inherent to WES. To maintain good accuracy, EH requires that the repeat region plus 1Kb flanks on both sides are sequenced to a relatively even coverage. In WES, some repeat regions might be only partially covered by reads or the interior of the repeat may be amplified less well due to GC bias. This is why EH officially supports only PCR-free WGS.

This said, I know of multiple projects that obtained useful results from WES data. This usually required extra benchmarking to delineate which repeats can be accurately called from WES.

Did I answer your question? Please let me know if you have any follow up questions or comments.

Best wishes, Egor

jingydz commented 1 year ago

“It is quite a bit harder to genotype STRs in WES than in PCR-free WGS because of the less even read coverage / amplification biases inherent to WES.” Why is it more difficult to detect STRs in WES compared to WGS?

"Read coverage" refers to the number of times each nucleotide is read during sequencing. Therefore, a higher read coverage means that there are more sequencing reads covering each nucleotide. "Amplification bias" refers to the phenomenon where certain DNA fragments are amplified more than others during PCR amplification, resulting in relatively more reads of these fragments appearing in sequencing.

The ratio of read coverage to amplification biases is commonly referred to as "evenness".

"In general, WES has higher evenness than WGS because WES only sequences the exonic regions. Compared to the whole genome, exonic regions are more concentrated and less prone to technical biases such as repeat amplification. Therefore, a more balanced sequencing coverage can be obtained. On the other hand, WGS needs to cover the entire genome and therefore faces more technical challenges such as repetitive sequences, GC bias, low complexity regions etc., often requiring deeper sequencing coverage for higher evenness."

The above is the information I found. In summary, the evenness of WES should be higher than that of WGS. Therefore, what I don't understand is why it's more difficult to detect STR in WES compared to WGS?