Closed kyleLesack closed 3 years ago
Yes each contig in the assembly is treated separately and Assemblytics doesn’t do any deduplication, so that is something to be aware of if your contigs might have duplicates.
On Fri, Jul 16, 2021 at 7:42 PM kyleLesack @.***> wrote:
Hi,
First of all, thank you for this great tool! I have been using it to compare some assembled C. elegans strains against the C. elegans reference.
I have noticed that some of my bed files contain entries with identical coordinates. For example:
grep "Assemblytics_b_2020" mydata.Assemblytics_structural_variants.bed II 14564319 14564430 Assemblytics_b_2020 111 + Deletion 111 0 tig00000041:264546-264546:+ between_alignments grep "Assemblytics_b_2057" mydata.Assemblytics_structural_variants.bed II 14564319 14564430 Assemblytics_b_2057 111 + Deletion 111 0 tig00000044:11348-11348:+ between_alignments
I assume that this has something to do with my assemblies and the alignment data used to call the variants for them as the query_coordinates differ. I could see how a large duplicated segment in a genome assembly could contain the same deletion. However, I wouldn't want it to be counted twice, as that could bias variant counts.
Is this something that you are aware of? Am I correct that this is an artifact that could be collapsed into a single deletion?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MariaNattestad/Assemblytics/issues/41, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4W4PPEXKXU3PQH3NLOWBTTYDUX7ANCNFSM5AQS2FXA .
Thanks for the quick response. That makes sense.
BTW - these calls can be flagged pretty easily using the bedtools cluster command in case anyone has a similar concern.
Hi,
First of all, thank you for this great tool! I have been using it to compare some assembled C. elegans strains against the C. elegans reference.
I have noticed that some of my bed files contain entries with identical coordinates. For example:
grep "Assemblytics_b_2020" mydata.Assemblytics_structural_variants.bed
II 14564319 14564430 Assemblytics_b_2020 111 + Deletion 111 0 tig00000041:264546-264546:+ between_alignmentsgrep "Assemblytics_b_2057" mydata.Assemblytics_structural_variants.bed
II 14564319 14564430 Assemblytics_b_2057 111 + Deletion 111 0 tig00000044:11348-11348:+ between_alignmentsI assume that this has something to do with my assemblies and the alignment data used to call the variants for them as the query_coordinates differ. I could see how a large duplicated segment in a genome assembly could contain the same deletion. However, I wouldn't want it to be counted twice, as that could bias variant counts.
Is this something that you are aware of? Am I correct that this is an artifact that could be collapsed into a single deletion?