Do not exclude reads with short repeats included

a-ludi commented 6 years ago

Reads will not be considered if they map to different locations on the reference with similar quality. This should take into account that sometimes only short regions of a read map to a different location (but with the same quality). This might be caused by short transposable elements (TE). Theoretically, this should be prohibited by masking repetitive regions. But, there are two reasons for this to fail:

The TE has only one copy on the input reference sequence and one or more other copies in gap regions. Thus, reads from this regions will align to both, the correct location at the gap boundary and the repeated element on the reference.
While the repetitive regions are assessed before eliminating ambiguously aligning reads, we use that information after elimination of these reads.

Countermeasures

Proper Alignments

The real alignment location should have a proper alignment. Proper alignments should be excluded in these two scenarios:

Should the same read have further proper alignments on the same contig or on more than one other contig(s) then it should be excluded because it maps ambiguously. However, the involved loci are not necessarily repetitive region as different regions of the read may be involved. By identifying those regions of the read that map to more than one locus we can derive some repetitive regions on the reference.
The read has a proper alignment that is fully contained in a contig. These alignments will only deteriorate performance of further processing. Other alignments of the same read, in contrast, will be likely wrong, so we should exclude all of them.

Short TE-induced local alignments

Short TEs will induce local alignments which are most certainly not proper. These alignments should (1) be used to further extend the repeat mask and (2) be discarded in further analysis if there is a "large enough"[^1] portion of the read that does not align to the reference – one might call this an "anti-anchor".

Thus, the first step in processing these alignments is to derive repetitive regions. Firstly, the region on the reference covered by the improper local alignment should be completely marked as repetitive. Secondly, should the involved portion of the read map to any other location on the reference then regions covered by that portion should be marked as repetitive as well.

Now, the second step is just the usual algorithm[^2] used to remove repeat-induced reads from the set of candidates.

Algorithmic structure

The filtering stage must be restructured as follows:

procedure filterReads:
begin
    assessRepeatStructure(selfAlignment)  // issue #25
    assessRepeatStructure(readsToReferenceAlignment)  // see above

    filterWeaklyAnchoredReads()  // issue #25
    filterAmbiguouslyMappingReads()  // see above
    filterRedundantReads()  // issue #3, #34
end

[^1]: see --min-anchor-length and issue #33. [^2]: see issue #25.

a-ludi commented 6 years ago

This issue requires issue #34 to be completed.

a-ludi commented 6 years ago

An alignment is called proper iff it starts and ends at a read boundary. In other words, given alignment begin b_A, end e_A and length l_A on read A and b_B, e_B and l_B, respectively, an alignment is called proper iff

(b_A = 0 or b_B = 0) and (e_A = l_A or e_B = l_B)

a-ludi commented 6 years ago

Tasks

[x] assessRepeatStructure(selfAlignment) // issue #25
[x] assessRepeatStructure(readsToReferenceAlignment) // see above
[x] reorder filters
[x] filterWeaklyAnchoredReads() // issue #25
[x] filterAmbiguouslyMappingReads() // see above
[x] filterRedundantReads() // issue #3, #34

a-ludi commented 6 years ago

Above it says:

[…] By identifying those regions of the read that map to more than one locus we can derive some repetitive regions on the reference. […] Secondly, should the involved portion of the read map to any other location on the reference then regions covered by that portion should be marked as repetitive as well. […]

Realizing a repeat detection by this technique is technically complex and might be achieved by a simpler approach: since all the involved alignments, resp. sub-regions thereof, map to the reference, these regions of the reference should map to themselves. Thus, by adjusting the error tolerance (-e switch for Dazzler tools) we should be able to observe the same regions.

a-ludi / djunctor