Explanation for 'other-vector complex'

dlbowie0 commented 2 days ago

It is possible to have a clearer definition of the 'other-vector complex' classification ? Below you will find a schema that I created of what I think a complex is . red bar -> forward strand with a subtype of full orange bar -> reverse strand with a subtype of right-partial.

There was also a mention of whether the primary alignments and supplementary alignments diverge for the complex classification in the python script. https://github.com/formbio/laava/blob/79eaf5475414d823e9655bc6d870733b1410c4de/src/summarize_alignment.py#L601-L603

Could you explain more in depth this concept?

etal commented 2 days ago

Yes, your figure is correct. The "other-vector complex" subtype is a collection of asymmetric, multi-part (usually double-stranded) alignments, meaning the primary and supplementary alignment of a read would have been assigned different subtypes individually, and therefore don't cleanly fall into any of the other categories when they appear in the same read.

Here's the design doc with full definitions: https://github.com/formbio/laava/wiki/Design-and-definitions-(v3.x-releases)#aav-subtypes

The individual per-alignment classifications relative to the annotated vector target region are still available in the "*.alignments.tsv.gz" file under the "map_target_overlap" column: https://github.com/formbio/laava/wiki/Design-and-definitions-(v3.x-releases)#alignment-metrics-sampleidalignmentstsvgz

In the previous 2.x releases of LAAVA, these reads would have been reported with a subtype consisting of the two "map_target_overlap" values joined by a pipe character, e.g. full|left-partial (or right-partial|partial, etc.). Users did not find these granular classifications helpful in the HTML/PDF report and would usually ignore those rows and/or recalculate what they wanted from the BAM file. Now in LAAVA 3.x the granular information is relegated to the TSV outputs and report rolls up the asymmetrical subtype zoo into a single complex subtype.

dlbowie0 commented 1 day ago

What about the cases where the read was assigned only one subtype and does not have a supplementary alignment? Here is screenshot from an analysis that I ran. This table information is from the SampleID.per_read.tsv.gz file:

etal commented 22 hours ago

That classification is probably coming from the code block you highlighted above. In the code, for the sake of minimizing differences in reported numbers versus previous versions, there is a difference between supp and supps:

The supplementary alignments of a read are in the list supps, just like they appear in the BAM file and alignments.tsv,gz.
Those supplementary alignments are filtered to look for one that is a close reverse-complement to the primary alignment. If one is found with >=80% overlap in the reverse direction relative to the primary alignment, then that supplementary alignment is stored in supp.

Therefore, there may be short and/or asymmetrical supplementary alignments that did not meet the criteria for storing anything in supp. You'll see those alignments in alignments.tsv.gz, but the has_supp column of per_read.tsv.gz will still be "N" for that read, in that case.

I'm open to the possibility that this special handling of supplementary alignments is a misfeature and should be changed in a future version of LAAVA.

formbio / laava

Explanation for 'other-vector complex' #66