Open mdshw5 opened 8 years ago
Hi @mdshw5,
Yes, we have been thinking about this issue. I can't say we have the perfect solution, but I do think we have some ideas. In general, one of the distinctions between the ultrafast mapping approaches and traditional alignment is that there is typically no "validation" step in the mapping approaches. Relatedly, we're typically looking for the single best match between the query and the reference, and the best match may not be a good match. For example, if a single k-mer from the query matches to a set of transcripts, and no other k-mer from the query matches anywhere, then this mapping is the best match (when, in reality, it may be better to leave this read completely unmapped).
One of the recent things we've been working on in RapMap and salmon 0.7.3 are some options for dealing with this. Particularly, something like this — an option which allows one to set some minimum coverage threshold for a read to be considered mappable. Relatedly, there are some internal variables in RapMap that it would be possible to expose that would allow preventing a match to be predicated entirely upon highly repeated sequence. For example, given the suffix array interval for a match (either a k-mer or a longer maximum matchable prefix), we know how many occurrences of this string exist in the reference. One could consider counting these matches toward coverage, but disallowing them from being "anchors" for a mapping (i.e. if the only matches for anchoring an alignment occur more than x times, then don't count them as anchors). Like I said, variables for these quantities already exist inside RapMap, but the challenge would be in terms of (1) exposing them and (2) deciding on reasonable and general heuristics to apply them.
To this end, I'd actually be very interested to hear feedback. Do these options sound useful? Which ones might be best? Are there other (reasonably efficient) filters that would also help? For example, we can trivially enforce co-linearity of the matches between the query and the reference, but non-colinear chains don't seem to actually show up to much for the best mappings. The coverage may be a different story though.
Hi Rob. I'm using the
--writeMappings
option in Salmon 0.7.2, but reporting this issue here since I think the code base is (will be) the same. I have a few genes that we've noticed (based on biological intuition) are called expression when they shouldn't be. The mapping data has really helped to diagnose these transcripts! Here's one example:This transcript has some repetitive sequence near the 3' end, which is masked by RepeatMasker:
The IGV screenshot has two tracks: the top track is using the transcript sequence as is, and the bottom is converting all of the RepeatMasker soft-masked bases to hard-masking ("N"). You can see that the mapping is greatly improved. I'm not sure other transcripts will benefit so much (or might even be harmed) by masking repetitive elements, but have you considered any way to deal with this issue? The reads mapping to this transcript are: