MariaNattestad / Assemblytics

Assemblytics is a bioinformatics tool to detect and analyze structural variants from a genome assembly by comparing it to a reference genome.
http://assemblytics.com
MIT License
135 stars 28 forks source link

Maximum variant size #27

Closed MehmetGoktay closed 4 years ago

MehmetGoktay commented 4 years ago

Hi Maria,

Is there a way to set maximum variant size bigger than 100k?

Apparently this is the limit for assemblytics web server.

Best, Mehmet

MariaNattestad commented 4 years ago

No, this is not possible on the web version, but you are welcome to run it locally and modify the code to your purposes.

The limits were set as described in the paper: "Assemblytics identifies all insertion and deletion variants as small as 1 bp up to a maximum of 10 kbp in size, with this maximum adjusted to match the size of the unique sequence anchor. This prevents translocations and complex variants from being interpreted as indels." You can read the rest for context at https://academic.oup.com/bioinformatics/article/32/19/3021/2196631. Especially see the supplement as it explains what the unique sequence anchor is and why it is important.

Best of luck with your research, Maria

bmansfeld commented 4 years ago

Hi Maria, Thanks for maintaining Assemblytics - It's a great piece of intuitive software! I do have one question regarding the max size that isn't clear to me after reading the supplemental note. If I increase both the "Unique sequence length required" and the "Maximum variant size" to greater than 10kb will that still be accurate? Or is 10kb a safety net to minimize incorrectly calling errors due to repeats? Basically, I anticipate seeing SVs much larger than 10kb in my comparison. Is Assemblytics still appropriate for use after tweaking both params? Are there any specifically different nucmer params that you would recommend for this? Thanks, Ben

MariaNattestad commented 4 years ago

Hi @bmansfeld

Above 10kb it will still be doing the same analysis of the MUMmer alignments, but yes, I put those limits in place to avoid calling above a size where I am no longer sure the interpretation of those variants holds. For instance, very large variants are increasingly likely to be caused by misassemblies. Assemblytics does not call variants other than indels and repeat expansions/contractions, so translocations or inversions might be called something else once you go beyond the safety net. It is a safety net you are welcome to go around, but I recommend you take a closer look at the alignments that produced those variants to see if the interpretation makes sense. I made this visualization tool https://dot.sandbox.bio/ that might be helpful for that analysis.

bmansfeld commented 4 years ago

Thanks for clearing that up Maria, I'll take a look at Dot. Stay safe! -Ben