Closed a-ludi closed 6 years ago
This issue requires issue #34 to be completed.
An alignment is called proper iff it starts and ends at a read boundary. In other words, given alignment begin b_A
, end e_A
and length l_A
on read A and b_B
, e_B
and l_B
, respectively, an alignment is called proper iff
assessRepeatStructure(selfAlignment) // issue #25
assessRepeatStructure(readsToReferenceAlignment) // see above
filterWeaklyAnchoredReads() // issue #25
filterAmbiguouslyMappingReads() // see above
filterRedundantReads() // issue #3, #34
Above it says:
[…] By identifying those regions of the read that map to more than one locus we can derive some repetitive regions on the reference. […] Secondly, should the involved portion of the read map to any other location on the reference then regions covered by that portion should be marked as repetitive as well. […]
Realizing a repeat detection by this technique is technically complex and might be achieved by a simpler approach: since all the involved alignments, resp. sub-regions thereof, map to the reference, these regions of the reference should map to themselves. Thus, by adjusting the error tolerance (-e
switch for Dazzler tools) we should be able to observe the same regions.
Reads will not be considered if they map to different locations on the reference with similar quality. This should take into account that sometimes only short regions of a read map to a different location (but with the same quality). This might be caused by short transposable elements (TE). Theoretically, this should be prohibited by masking repetitive regions. But, there are two reasons for this to fail:
Countermeasures
Proper Alignments
The real alignment location should have a proper alignment. Proper alignments should be excluded in these two scenarios:
Short TE-induced local alignments
Short TEs will induce local alignments which are most certainly not proper. These alignments should (1) be used to further extend the repeat mask and (2) be discarded in further analysis if there is a "large enough"[^1] portion of the read that does not align to the reference – one might call this an "anti-anchor".
Thus, the first step in processing these alignments is to derive repetitive regions. Firstly, the region on the reference covered by the improper local alignment should be completely marked as repetitive. Secondly, should the involved portion of the read map to any other location on the reference then regions covered by that portion should be marked as repetitive as well.
Now, the second step is just the usual algorithm[^2] used to remove repeat-induced reads from the set of candidates.
Algorithmic structure
The filtering stage must be restructured as follows:
[^1]: see
--min-anchor-length
and issue #33. [^2]: see issue #25.