COMBINE-lab / RapMap

Rapid sensitive and accurate read mapping via quasi-mapping
GNU General Public License v3.0
89 stars 23 forks source link

mateIsFwd never gets a value set for unpaired hits #46

Open jenni-westoby opened 5 years ago

jenni-westoby commented 5 years ago

In RapMap v.0.5.0, the value of mateIsFwd is never set for unpaired hits in the function mergeLeftRightHits, in the file include/RapMapUtils.hpp. The function in question is below


        inline void mergeLeftRightHits(
                std::vector<QuasiAlignment>& leftHits,
                std::vector<QuasiAlignment>& rightHits,
                std::vector<QuasiAlignment>& jointHits,
                uint32_t readLen,
                uint32_t maxNumHits,
                bool& tooManyHits,
                HitCounters& hctr) {
            if (leftHits.size() > 0) {
                constexpr const int32_t signedZero{0};
                auto leftIt = leftHits.begin();
                auto leftEnd = leftHits.end();
                auto leftLen = std::distance(leftIt, leftEnd);
                if (rightHits.size() > 0) {
                    auto rightIt = rightHits.begin();
                    auto rightEnd = rightHits.end();
                    auto rightLen = std::distance(rightIt, rightEnd);
                    size_t numHits{0};
                    jointHits.reserve(std::min(leftLen, rightLen));
                    uint32_t leftTxp, rightTxp;
                    while (leftIt != leftEnd && rightIt != rightEnd) {
                        leftTxp = leftIt->tid;
                        rightTxp = rightIt->tid;
                        if (leftTxp < rightTxp) {
                            ++leftIt;
                        } else {
                            if (!(rightTxp < leftTxp)) {
                                int32_t startRead1 = std::max(leftIt->pos, signedZero);
                                int32_t startRead2 = std::max(rightIt->pos, signedZero);
                                bool read1First{(startRead1 < startRead2)};
                                int32_t fragStartPos = read1First ? startRead1 : startRead2;
                                int32_t fragEndPos = read1First ?
                                    (startRead2 + rightIt->readLen) : (startRead1 + leftIt->readLen);
                                uint32_t fragLen = fragEndPos - fragStartPos;
                                jointHits.emplace_back(leftTxp,
                                        startRead1,
                                        leftIt->fwd,
                                        leftIt->readLen,
                                        fragLen, true);
                                // Fill in the mate info
                                auto& qaln = jointHits.back();
                                qaln.mateLen = rightIt->readLen;
                                qaln.matePos = startRead2;
                                qaln.mateIsFwd = rightIt->fwd;
                                jointHits.back().mateStatus = MateStatus::PAIRED_END_PAIRED;

                                ++numHits;
                                if (numHits > maxNumHits) { tooManyHits = true; break; }
                                ++leftIt;
                            }
                            ++rightIt;
                        }
                    }
                }
                if (tooManyHits) { jointHits.clear(); ++hctr.tooManyHits; }
            }

            // If we had proper paired hits
            if (jointHits.size() > 0) {
                hctr.peHits += jointHits.size();
                //orphanStatus = 0;
            } else if (leftHits.size() + rightHits.size() > 0 and !tooManyHits) {
                // If there weren't proper paired hits, then either
                // there were too many hits, and we forcibly discarded the read
                // or we take the single end hits.
                auto numHits = leftHits.size() + rightHits.size();
                hctr.seHits += numHits;
                //orphanStatus = 0;
                //orphanStatus |= (leftHits.size() > 0) ? 0x1 : 0;
                //orphanStatus |= (rightHits.size() > 0) ? 0x2 : 0;
                jointHits.insert(jointHits.end(),
                        std::make_move_iterator(leftHits.begin()),
                        std::make_move_iterator(leftHits.end()));
                jointHits.insert(jointHits.end(),
                        std::make_move_iterator(rightHits.begin()),
                        std::make_move_iterator(rightHits.end()));
            }
        }

The problem occurs because in the final else if (leftHits.size() + rightHits.size() > 0 and !tooManyHits) statement, hits are appended to the end of joint hits without updating the unset mateIsFwd field. Instead mateIsFwd is 'randomly' filled with a garbage value which appears to have some dependency on the previous read which was aligned. The value which is assigned to mateIsFwd then has an impact on the SAM flag because mateIsFwd is used to determine the SAM flag in getSamFlags. Consequently, read SRR5237781.4 in problem4_1.fastq and problem4_2.fastq usually produces the following alignment on my machine:

SRR5237781.4    89  ENSMUST00000020488  36263   1   100M    =   36263   0   ATCCTTCTGAGGTTAGGGATCTAGGGGAAGTGGGGGATGTGCTCTGAATTAACTTTGTTAGGTAATTTAGAATATAGGAATCTCTAACATTGGACTGAAC    *   NH:i:1
SRR5237781.4    165 ENSMUST00000020488  36263   0   *   =   36263   0   GTAAGGCCAGCAATCCTTCTGAGGTTAGGGATCTAGGGGAAGTGGGGGATGTGCTCTGAATTAACTTTGTTAGGTAATTTAGAATATAGGAATCTCTAAC    *   NH:i:1

Whereas the same read in problem4_3read_1.fastq and problem4_3read_2.fastq usually produces this alignment (virtually identical except different SAM flags):

SRR5237781.4    121 ENSMUST00000020488  36263   1   100M    =   36263   0   ATCCTTCTGAGGTTAGGGATCTAGGGGAAGTGGGGGATGTGCTCTGAATTAACTTTGTTAGGTAATTTAGAATATAGGAATCTCTAACATTGGACTGAAC    *   NH:i:1
SRR5237781.4    181 ENSMUST00000020488  36263   0   *   =   36263   0   GTAAGGCCAGCAATCCTTCTGAGGTTAGGGATCTAGGGGAAGTGGGGGATGTGCTCTGAATTAACTTTGTTAGGTAATTTAGAATATAGGAATCTCTAAC    *   NH:i:1

This problem does not occur for proper paired hits, because mateIsFwd is correctly set in the if (!(rightTxp < leftTxp)) code block. On the develop-salmon branch, read SRR5237781.4 produces proper paired hits, so this test example does not work (or rather, it does work, because mateIsFwd is correctly set in if (!(rightTxp < leftTxp)) ). Unfortunately I have not yet been able to find as nice a test example for the develop-salmon branch yet, but from reading the code I believe that this might also be a problem on the develop-salmon branch. The fix would be to set the value of mateIsFwd in the final else if statement of mergeLeftRightHits().

problem4_1.fastq.gz problem4_2.fastq.gz problem4_3read_1.fastq.gz problem4_3read_2.fastq.gz

rob-p commented 5 years ago

Hi @jenni-westoby,

Thanks (again) for finding this and bringing it to our attention, and for the brilliantly-detailed bug report. Perhaps there are a couple of reasons this evaded us. The big one, I think, is that when the mate is unmapped, we internally assume that none of its flags are really meaningful. It seems that, while this is valid for many flags (SAM spec, section 1.4, bottom of page 6), it's actually not true for the 0x10 flag --- this should still take on a meaningful value. We'd simply been assuming that if a read is not mapped, it is written down in the orientation in which it appears in the input file (i.e. always forward). I've pushed a commit to the develop-salmon branch that should take care of this. Basically, at the places where the QuasiAlignment objects for the ends of the fragment are filled in, we now explicitly set mateIsFwd to true, so that if this object gets propagated to jointHits later, the 0x10 flag will come out properly. Given that 0.5.0 is a tagged release, I can't really change it easily there (also there has been substantial code re-origanization, so the precise nature of the fix would be different). Hopefully, though, the changes from develop-salmon can be merged to master very soon, at which point I'll tag a new release with this bug fix.

The downside of fixing it on this branch, means that I can't immediately check the fix against your example (because the read is properly paired). So, I'll leave this issue open for the time being until we can ensure that my fix actually works. Thanks again!

jenni-westoby commented 5 years ago

Thanks for the speedy response! And apologies for not having a test example that worked for develop-salmon.