COMBINE-lab / RapMap

Rapid sensitive and accurate read mapping via quasi-mapping
GNU General Public License v3.0
89 stars 23 forks source link

Mapping to repetitive regions #32

Open mdshw5 opened 8 years ago

mdshw5 commented 8 years ago

Hi Rob. I'm using the --writeMappings option in Salmon 0.7.2, but reporting this issue here since I think the code base is (will be) the same. I have a few genes that we've noticed (based on biological intuition) are called expression when they shouldn't be. The mapping data has really helped to diagnose these transcripts! Here's one example:

IGV mapped data

This transcript has some repetitive sequence near the 3' end, which is masked by RepeatMasker:

>ENSMUST00000034602.7
GAATACCCATGAGTCTGTTGGCGGAGGATAGTGCTAACCCTGGCTGATCATCCTGTGGCTTGCCTCTATTTGTTGCATGATTCCCTCTGGAAGATGGAACACAGCGGGATTCTGGCTAGTCTGATACTGATTGCTGTTCTCCCCCAAGGGAGCCCCTTCAAGATACAAGTGACCGAATATGAGGACAAAGTATTTGTGACCTGCAATACCAGCGTCATGCATCTAGATGGAACGGTGGAAGGATGGTTTGCAAAGAATAAAACACTCAACTTGGGCAAAGGCGTTCTGGACCCACGAGGGATATATCTGTGTAATGGGACAGAGCAGCTGGCAAAGGTGGTGTCTTCTGTGCAAGTCCATTACCGAATGTGCCAGAACTGTGTGGAGCTAGACTCGGGCACCATGGCTGGTGTCATCTTCATTGACCTCATCGCAACTCTGCTCCTGGCTTTGGGCGTCTACTGCTTTGCAGGACATGAGACCGGAAGGCCTTCTGGGGCTGCTGAGGTTCAAGCACTGCTGAAGAATGAGCAGCTGTATCAGCCTCTTCGAGATCGTGAAGATACCCAGTACAGCCGTCTTGGAGGGAACTGGCCCCGGAACAAGAAATCTTAAGATGGAGGTTTCTCAAAGTGCCATAGCAACGGCGCCTTCCCCTGTGATCAACCAATAAAGACGTTTCCTTCCGTCGGCAGGCGCCTGTGTTTTCCAAAGCATGGATTGAGAGTTGTCTTAGCTTGGCTGAGTTCTGCCCACCCGTGGCCCCTCACTTCTGGCTTCCTCTCCTCAGACCCAGACAGGCAGTACGAGGGACAGCTTCCCATCCTGGCCCCTTGGCATTTCTAACCCTGTGGGTTTCCTGTGGGCAGGGTCAGCCCAGCCACTCTGCTTCATTGCCTGTTGGTTCCAGTCCCTATCATAACAACTTCCGGGATTTTGCTGCCTTACCACAGCCCAAAGCAGGCTGTGGTGTTTAGAAGCTGAGCCACAAAGAAGTTTCCATGACATCATGAATGGGGGTGGCAGAGAAGAATATTGGGGCTCAGAGGgtgtgtgtgtgtgtgtgtgtgtgtgtgtgagtgagagagagagagagagagagagagagagagagagagagagacagacagagagacagagagacagaaagacagaaagacagaaagacagagacagagctgtcAAGaattttcttctcacttcgtgggttctggggattgtactaaggtcatgaggcttgtgtggcaagcacccttacccactgagccatcttgcctgcccTCATCCTCaaattaattaaaaataaagaacatgatgaattcc

The IGV screenshot has two tracks: the top track is using the transcript sequence as is, and the bottom is converting all of the RepeatMasker soft-masked bases to hard-masking ("N"). You can see that the mapping is greatly improved. I'm not sure other transcripts will benefit so much (or might even be harmed) by masking repetitive elements, but have you considered any way to deal with this issue? The reads mapping to this transcript are:

1       163 ENSMUST00000034602.7    146 1   80M =   319 253 AAGGGAGCCCCTTCAAGATACAAGTGACCGAATATGAGGACAAAGTATTTGTGACCTGCAATACCAGCGTCATGCATCTA    *   NH:i:1
2       83  ENSMUST00000034602.7    319 1   80M =   146 -253    ACAGAGCAGCTGGCAAAGGTGGTGTCTTCTGTGCAAGTCCATTACCGAATGTGCCAGAACTGTGTGGAGCTAGACTCCGG    *   NH:i:1
3       121 ENSMUST00000034602.7    976 1   125M    =   976 0   CCAGTTGGTTGTCTAGCACCAAATAGCTCAGAAAATACACATAAGTAATGTTGTACACACTGAGCAGGTTATATGTTTAGAAGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAACGCACA   *   NH:i:1
4       377 ENSMUST00000034602.7    988 1   125M    =   988 0   CCCTGTATTATAATGACACTGGGATGTGAAGCCAGCTCTTCTCACTCCAGCCTGCTCTTTGTCTGCTTCGGCACACCTCTTATCACCCCTCAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG   *   NH:i:15
5       121 ENSMUST00000034602.7    1007    1   80M =   1007    0   GAACCGGGATTGGTCTATTTGTCCTTTTAGAGGCCAGACAACTGCATATGTGTGTGTGTGTGTGTGTGTGTGAGTGAGAG    *   NH:i:1
6       163 ENSMUST00000034602.7    1010    1   80M =   1018    88  ATACACATAAGTAATGTTGTACACACTGAGCAGGTTATATGTTTAGAAGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGA    *   NH:i:1
7       121 ENSMUST00000034602.7    1017    1   80M =   1017    0   CAAATACTTTTAAAGCAGTCCAAGTTTATTTTCATTCGTGTGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAGAG    *   NH:i:1
8       83  ENSMUST00000034602.7    1018    1   80M =   1010    -88 AAGTAATGTTGTACACACTGAGCAGGTTATATGTTTAGAAGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAACGC    *   NH:i:1
9       121 ENSMUST00000034602.7    1021    1   80M =   1021    0   TAATGTTGTACACACTGAGCAGGTTATATGTTTAGAAGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAACGCACA    *   NH:i:1
10      163 ENSMUST00000034602.7    1022    1   125M    =   1056    159 AATGTTGTACACACTGAGCAGGTTATATGTTTAGAAGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAACGCACATGTATATGTATGTACATACATATGCACATAACAGCAATTAATGAAA   *   NH:i:1
11      105 ENSMUST00000034602.7    1024    1   80M =   1024    0   CCCGGCATAGTCTGATTTATTTATTTTTCATTGTGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAGAGAGAGAGA    *   NH:i:1
12      153 ENSMUST00000034602.7    1030    1   80M =   1030    0   GTTTCTTAATCTCTCTTTCTCTTTCTGTGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGTCTTTTCTATACCAGTTTATA    *   NH:i:1
13      361 ENSMUST00000034602.7    1033    1   80M =   1033    0   GTCAGCGCTAAGAAGTCCTGTTAGACCGTGTGCGTGTGCGTGTGTGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG    *   NH:i:2
14      393 ENSMUST00000034602.7    1035    1   80M =   1035    0   TAATAATTCTCATAATTTTTTTCTCCTTACCATGTGCATGTGTGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG    *   NH:i:2
15      377 ENSMUST00000034602.7    1035    1   80M =   1035    0   GCAGAAAGTTAAATAAAAGGTAAGTACGTGTGTGTGGCCAAATCAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG    *   NH:i:15
16      361 ENSMUST00000034602.7    1036    1   80M =   1036    0   AGCGCTAAGAAGTCCTGTTAGACCGTGTGCGTGTGCGTGTGTGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA    *   NH:i:2
17      361 ENSMUST00000034602.7    1044    1   80M =   1044    0   CGGAAATGTGAAAGTTCTGAGACAGACATATAGACAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA    *   NH:i:7
18      153 ENSMUST00000034602.7    1050    1   80M =   1050    0   GTTTAGAAGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAACGCACATGTATATGTATGTACATACATATGCACAT    *   NH:i:1
19      99  ENSMUST00000034602.7    1054    1   125M    =   1065    136 GTGTGAGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAGAGACTAAGTTAAATTTAAGAGTTCCATCTGACTTGAATGAGCAGCACAAACTTAACACTAACCACAAAATACCTGGGGCTTTCT   *   NH:i:1
20      83  ENSMUST00000034602.7    1056    1   125M    =   1022    -159    AAGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAACGCACATGTATATGTATGTACATACATATGCACATAACAGCAATTAATGAAAAAAGAGGACATGAATTTAAAAGAGCAAGGAGGGG   *   NH:i:1
21      105 ENSMUST00000034602.7    1056    1   80M =   1056    0   GTGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGCGCGTGAGAGAGAGCGGGCGAGCACACTCTGTGCCTCTCAGGAAG    *   NH:i:1
22      121 ENSMUST00000034602.7    1056    1   80M =   1056    0   GTGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAGAGAGAGAGCAAGAGCGAGCGAGAGCGAAAGCATGTTCAGGT    *   NH:i:1
23      153 ENSMUST00000034602.7    1058    1   80M =   1058    0   GAGAAAATGTGTGTGTGTGTGAGTGAGAGAGAGAGAGAGAGAGAGCGAGCGCAGGCTGGTGCCTTTTTCAGTAACTTCCC    *   NH:i:1
24      147 ENSMUST00000034602.7    1065    1   125M    =   1054    -136    TGTGTGTGTGTGTGAGTGAGAGAGAGAGAGAGACTAAGTTAAATTTAAGAGTTCCATCTGACTTGAATGAGCAGCACAAACTTAACACTAACCACAAAATACCTGGGGCTTTCTCAGTCAGAAAC   *   NH:i:1
25      361 ENSMUST00000034602.7    1086    1   80M =   1086    0   GAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACAGTGTGCAAATGATGCAAAGCCTACTGCCTAAGTTGTATTT    *   NH:i:22
26      409 ENSMUST00000034602.7    1094    1   80M =   1094    0   GAGAGAGAGAGAGAGAGAGAGAGAGAGAGACAGAAACGTGACTTCCTCTTTTTCTATTTGTAGAACATTTATTTTCTTCC    *   NH:i:27
27      83  ENSMUST00000034602.7    1134    1   125M    =   1137    128 GACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAG   *   NH:i:1
28      163 ENSMUST00000034602.7    1137    1   125M    =   1134    -128    AGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAA   *   NH:i:1
29      99  ENSMUST00000034602.7    1137    1   80M =   1139    82  AGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGAC    *   NH:i:1
30      147 ENSMUST00000034602.7    1139    1   80M =   1137    -82 AAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAG    *   NH:i:1
31      105 ENSMUST00000034602.7    1141    1   125M    =   1141    0   AGACAGAAAGACAGAAAGACAGAAAGACAGAAACACATAGAGGTGTACTGTACAACTGGTAAGAAAAGGCATATTTTTTAAATATAGAAGTGCCTACATATAGAGTGTACTTGGGAAAGAAAAGA   *   NH:i:1
32      169 ENSMUST00000034602.7    1225    1   80M =   1225    0   TAACATCATCAGGCTTGGTGGCAAGCACCCTTACCCACTGAGCCATCTAATAGCTCTTCTCTTGAATTTTTTTAAACACC    *   NH:i:1
rob-p commented 8 years ago

Hi @mdshw5,

Yes, we have been thinking about this issue. I can't say we have the perfect solution, but I do think we have some ideas. In general, one of the distinctions between the ultrafast mapping approaches and traditional alignment is that there is typically no "validation" step in the mapping approaches. Relatedly, we're typically looking for the single best match between the query and the reference, and the best match may not be a good match. For example, if a single k-mer from the query matches to a set of transcripts, and no other k-mer from the query matches anywhere, then this mapping is the best match (when, in reality, it may be better to leave this read completely unmapped).

One of the recent things we've been working on in RapMap and salmon 0.7.3 are some options for dealing with this. Particularly, something like this — an option which allows one to set some minimum coverage threshold for a read to be considered mappable. Relatedly, there are some internal variables in RapMap that it would be possible to expose that would allow preventing a match to be predicated entirely upon highly repeated sequence. For example, given the suffix array interval for a match (either a k-mer or a longer maximum matchable prefix), we know how many occurrences of this string exist in the reference. One could consider counting these matches toward coverage, but disallowing them from being "anchors" for a mapping (i.e. if the only matches for anchoring an alignment occur more than x times, then don't count them as anchors). Like I said, variables for these quantities already exist inside RapMap, but the challenge would be in terms of (1) exposing them and (2) deciding on reasonable and general heuristics to apply them.

To this end, I'd actually be very interested to hear feedback. Do these options sound useful? Which ones might be best? Are there other (reasonably efficient) filters that would also help? For example, we can trivially enforce co-linearity of the matches between the query and the reference, but non-colinear chains don't seem to actually show up to much for the best mappings. The coverage may be a different story though.