MariaNattestad / Assemblytics

Assemblytics is a bioinformatics tool to detect and analyze structural variants from a genome assembly by comparing it to a reference genome.
http://assemblytics.com
MIT License
135 stars 28 forks source link

Difference between the 'within_alignment' and the 'between_alignment' methods #57

Open LeoVincenzi opened 8 months ago

LeoVincenzi commented 8 months ago

Dear authors, I'm writing to understand better the difference between the 'within_alignment' method and the 'between_alignment' one applied in Assemblytics, especially in the case of deletion. From what I have seen looking at an alingment between two assembled genome, the 'within_alignment' deletion results in a portion where a sequence is well aligned over another one. On the other side, the 'between_alignment' deletions are always on a portion where the sequence aligned presents a deletion, a soft-clipped region or a gap. So my question is: why should we consider the first method as correct for deletion detection?

Thank you, Leo

MariaNattestad commented 8 months ago

The within-alignment deletions are equivalent to what you would see in the CIGAR string of a BAM file, so they are the easiest to detect since they happen within a single alignment. For example, a small deletion like this:

ACATGCTGATCG
ACATG--GATCG

(This is a toy example where the deletion and sequences are both much smaller than you would detect with Assemblytics. In this example, two bases are deleted, which was detected within an alignment.)

If you want to visualize a within-alignment deletion in your own dataset, MUMmer's show-aligns command will let you do that.

LeoVincenzi commented 8 months ago

Okay, I will try also with MUMmer. In another way, I tried to visualize the alignments (BAM file) with the Integrated Genome Viewer. What I have seen is that sometimes, independently from the 'method' assigned, the bad file of Assemblytics presents Deletions in regions where it does not seem to be present. I report below first an example of a 'within_alignment' deletion and secondly a 'between_alignemnt' deletion.

immagine

immagine

I also checked the CIGAR string of these regions but they don't show any D. How could this be explained?

MariaNattestad commented 8 months ago

Is the BAM file also coming from MUMmer? Assemblytics is simply looking in the delta file for these within-alignment deletions, so it's very possible that if the alignments are different between what you show in IGV and what is in the delta file, that's the source of the difference. That's why I suggest using MUMmer show-aligns to visualize this part of the delta file so you can see the deletion in the delta file visually. The between-alignment deletion in the second example looks like it falls between the alignments, so if those alignments are of the same contig, then that would make sense and be exactly what we expect to see.

As a general note though, Assemblytics was made 10 years ago when there were no other variant callers for assemblies. Today you have more options for assembly-based variant callers, so I recommend you read some papers to find the newer variant callers and see if one of those works better for your purposes. Genome assembly technologies have also changed in 10 years, so I just wouldn't expect Assemblytics to be the best method anymore.