Illumina / ExpansionHunter

A tool for estimating repeat sizes
Other
182 stars 51 forks source link

Cannot process offtarget mates for locus_n because repeat unit is not set #119

Open mvcakir opened 3 years ago

mvcakir commented 3 years ago

Hello,

I have an issue about the off targets. At some point the program stops and throws this error. But when I check the catalog.json file the entry looks fine: { "ReferenceRegion": "chr8:16007595-16007646", "VariantType": "Repeat", "LocusStructure": "(TATC)*", "LocusId": "locus_n", "OfftargetRegions": [ "chr14:27990612-27990623", "chr7:54347739-54348473", "chrX:77460138-77460149", "chr7:67149722-67149781", "chr8:20573715-20573769" ] }

I got the error for another entry as well: { "ReferenceRegion": "chr8:16001967-16002012", "VariantType": "Repeat", "LocusStructure": "(AC)*", "LocusId": "locus_m", "OfftargetRegions": [ "chr1:179020154-179020196", "chrX:12681696-12681747", "chr14:89416169-89416251", "chr21:37817465-37818005", "chr2:878549-878569" ] }

When I remove "chr21:37817465-37818005" off target it works somehow. I'm not sure about the problem of that region though.

Thanks a lot

Best,

Volkan

egor-dolzhenko commented 3 years ago

Hi Volkan. Thanks for raising the issue. Could you please check if changing the variant type from "Repeat" to "RareRepeat" resolves the issue?

mvcakir commented 3 years ago

Dear Egor,

Yes it solves the issue. Why is the reason, or how can I avoid this error?

Best,

Volkan

egor-dolzhenko commented 3 years ago

Great question, Volkan. The off target regions are only allowed for "RareRepeat"s (EH uses in-repeat read pairs only for "RareRepeats" and common locations where these reads misalign correspond to off target regions.) We need to improve input validation to warn the user about this.

Can I ask what procedure you are using to define off target regions?

mvcakir commented 3 years ago

Ah ok now it is clear, I was a little bit confused about how to define my own regions apparently. I'm making a dictionary of repeats over all genome, and define off targets where similar stretches are identified in elsewhere. But I must put more care into the length of the stretch, number of the repeats etc I think. I'm just trying to automate a pipeline of several tools.

Thanks a lot for your help.

Best,

Volkan

egor-dolzhenko commented 3 years ago

This sounds good! I think it might be better to not define off target regions for most/all repeats in your catalog (and so set variant types of all repeats to "Repeat" instead of "RareRepeat"). EH would extract misaligned reads automatically as long as one mate is aligned close to the repeat.