Open dlbowie0 opened 2 days ago
Yes, your figure is correct. The "other-vector complex" subtype is a collection of asymmetric, multi-part (usually double-stranded) alignments, meaning the primary and supplementary alignment of a read would have been assigned different subtypes individually, and therefore don't cleanly fall into any of the other categories when they appear in the same read.
Here's the design doc with full definitions: https://github.com/formbio/laava/wiki/Design-and-definitions-(v3.x-releases)#aav-subtypes
The individual per-alignment classifications relative to the annotated vector target region are still available in the "*.alignments.tsv.gz" file under the "map_target_overlap" column: https://github.com/formbio/laava/wiki/Design-and-definitions-(v3.x-releases)#alignment-metrics-sampleidalignmentstsvgz
In the previous 2.x releases of LAAVA, these reads would have been reported with a subtype consisting of the two "map_target_overlap" values joined by a pipe character, e.g. full|left-partial
(or right-partial|partial
, etc.). Users did not find these granular classifications helpful in the HTML/PDF report and would usually ignore those rows and/or recalculate what they wanted from the BAM file. Now in LAAVA 3.x the granular information is relegated to the TSV outputs and report rolls up the asymmetrical subtype zoo into a single complex
subtype.
What about the cases where the read was assigned only one subtype and does not have a supplementary alignment? Here is screenshot from an analysis that I ran. This table information is from the SampleID.per_read.tsv.gz file:
That classification is probably coming from the code block you highlighted above. In the code, for the sake of minimizing differences in reported numbers versus previous versions, there is a difference between supp
and supps
:
supps
, just like they appear in the BAM file and alignments.tsv,gz.supp
.Therefore, there may be short and/or asymmetrical supplementary alignments that did not meet the criteria for storing anything in supp
. You'll see those alignments in alignments.tsv.gz, but the has_supp
column of per_read.tsv.gz will still be "N" for that read, in that case.
I'm open to the possibility that this special handling of supplementary alignments is a misfeature and should be changed in a future version of LAAVA.
It is possible to have a clearer definition of the 'other-vector complex' classification ? Below you will find a schema that I created of what I think a complex is . red bar -> forward strand with a subtype of full orange bar -> reverse strand with a subtype of right-partial.
There was also a mention of whether the primary alignments and supplementary alignments diverge for the complex classification in the python script. https://github.com/formbio/laava/blob/79eaf5475414d823e9655bc6d870733b1410c4de/src/summarize_alignment.py#L601-L603
Could you explain more in depth this concept?