MariaNattestad / Assemblytics

Assemblytics is a bioinformatics tool to detect and analyze structural variants from a genome assembly by comparing it to a reference genome.
http://assemblytics.com
MIT License
135 stars 28 forks source link

Variants with the same coordinates #41

Closed kyleLesack closed 3 years ago

kyleLesack commented 3 years ago

Hi,

First of all, thank you for this great tool! I have been using it to compare some assembled C. elegans strains against the C. elegans reference.

I have noticed that some of my bed files contain entries with identical coordinates. For example:

grep "Assemblytics_b_2020" mydata.Assemblytics_structural_variants.bed II 14564319 14564430 Assemblytics_b_2020 111 + Deletion 111 0 tig00000041:264546-264546:+ between_alignments grep "Assemblytics_b_2057" mydata.Assemblytics_structural_variants.bed II 14564319 14564430 Assemblytics_b_2057 111 + Deletion 111 0 tig00000044:11348-11348:+ between_alignments

I assume that this has something to do with my assemblies and the alignment data used to call the variants for them as the query_coordinates differ. I could see how a large duplicated segment in a genome assembly could contain the same deletion. However, I wouldn't want it to be counted twice, as that could bias variant counts.

Is this something that you are aware of? Am I correct that this is an artifact that could be collapsed into a single deletion?

MariaNattestad commented 3 years ago

Yes each contig in the assembly is treated separately and Assemblytics doesn’t do any deduplication, so that is something to be aware of if your contigs might have duplicates.

On Fri, Jul 16, 2021 at 7:42 PM kyleLesack @.***> wrote:

Hi,

First of all, thank you for this great tool! I have been using it to compare some assembled C. elegans strains against the C. elegans reference.

I have noticed that some of my bed files contain entries with identical coordinates. For example:

grep "Assemblytics_b_2020" mydata.Assemblytics_structural_variants.bed II 14564319 14564430 Assemblytics_b_2020 111 + Deletion 111 0 tig00000041:264546-264546:+ between_alignments grep "Assemblytics_b_2057" mydata.Assemblytics_structural_variants.bed II 14564319 14564430 Assemblytics_b_2057 111 + Deletion 111 0 tig00000044:11348-11348:+ between_alignments

I assume that this has something to do with my assemblies and the alignment data used to call the variants for them as the query_coordinates differ. I could see how a large duplicated segment in a genome assembly could contain the same deletion. However, I wouldn't want it to be counted twice, as that could bias variant counts.

Is this something that you are aware of? Am I correct that this is an artifact that could be collapsed into a single deletion?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MariaNattestad/Assemblytics/issues/41, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4W4PPEXKXU3PQH3NLOWBTTYDUX7ANCNFSM5AQS2FXA .

kyleLesack commented 3 years ago

Thanks for the quick response. That makes sense.

BTW - these calls can be flagged pretty easily using the bedtools cluster command in case anyone has a similar concern.