RWilton / Arioc

Arioc: GPU-accelerated DNA short-read alignment
BSD 3-Clause "New" or "Revised" License
65 stars 9 forks source link

Potentially incorrect configuration values in documentation #36

Closed kennyworkman closed 2 months ago

kennyworkman commented 2 months ago

Hi @RWilton, thank you for your work.

On page 24 of the manual linked in the README, you have these attributes for the gapped alignment.

<gapped seed="hsi20_0_30" maxJ="200" Wmxgs="2,6,5,3" Vt="100" />

From my understanding this constant score threshold will always exceed the maximum possible score computed with a seed size of 20. The ValidateScoreFunction seems to compute 40, if the BWm is 2.

    // threshold score for a Q sequence of length N (windowed gapped alignments only)
    double Vtw = (ASP.tpGw.sfT == ScoreFunctionType::G) ? ASP.tpGw.sfA*log(static_cast<double>(Nmax)) + ASP.tpGw.sfB :
                 (ASP.tpGw.sfT == ScoreFunctionType::S) ? ASP.tpGw.sfA*sqrt(static_cast<double>(Nmax)) + ASP.tpGw.sfB :
                 (ASP.tpGw.sfT == ScoreFunctionType::L) ? ASP.tpGw.sfA*Nmax + ASP.tpGw.sfB :
                                                          ASP.tpGw.sfB;
    if( (Vtw > Vp) || (Vtw <= 0) )
        throw new ApplicationException( __FILE__, __LINE__, "The specified Vtw score function (%s) is invalid for a query sequence of length %d", AriocAlignmentScorer::ScoreFunctionToString(ASP.tpGw), Nmax );

Indeed this is the error I get when I run AriocP with the above configuration.

191202064 [00000de9] AriocP v1.52.3143.24176 (release) +2024-06-24T11:14:50
191202067 [00000de9] Copyright (c) 2015-2024 Johns Hopkins University.  All rights reserved.
191202067 [00000de9]  compiled             : 2024-09-06T06:44:19 (GNU g++ v11.4.0)
191202067 [00000de9]  data type sizes      : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian)
191202067 [00000de9]  computer name        : <redacted>
191202067 [00000de9]  operating system     : Ubuntu 22.04.4 LTS
191202067 [00000de9]  executable file      : /root/bin/AriocP
191202067 [00000de9]  configuration file   : /root/arioc-config/align.cfg
191202757 [00000de9] ApplicationException ([0x00003561] AriocCommon/AriocAlignmentScorer.cpp 177): The specified Vt score function (100) is invalid for a query sequence of length 25

Switching to a linear scoring function, with a negative offset, as I've seen you do in these issues allows the code to run. Am I missing something?

RWilton commented 2 months ago

No, you are not missing anything! This is expected behavior from the aligner's point of view. After all, it makes no sense to specify an alignment score threshold Vt of 100 when you already know that the input read sequences can be as short as 25 bases and that those perfectly mapped 25-base reads will have an alignment score of 50.

The solution is to choose appropriate alignment scoring parameters for your reads, and that depends on your own good judgment.

If every read in the input has the same length, it is simplest to specify Vt as a constant. (In the case of 25-base reads with a base-match score of 2, perhaps Vt = 40 would be reasonable if you are using the default alignment score values.) If the input reads have different lengths, then it probably makes more sense to specify Vt as a function of the read length. (Don't forget that AriocE's output log contains information regarding the range of read lengths in your data.)

Either way, the more reads you have that do not map with perfect alignment scores, the more it pays to use your alignment results to go back and tune Vt. If Vt is too high, the aligner will exclude too many acceptable mappings from your results. If Vt is too low, the aligner will report too many questionable mappings. You can start by making assumptions about how many mismatches and how many gaps you would accept in any given mapping, and then choosing the scoring parameters (including Vt) accordingly.

Paired-end reads complicate things a bit because the most interesting mappings may be those where one mate maps with a high alignment score and the opposite mate maps with a borderline alignment score in a region of interest (a region of high variability, for example). Arioc's Vtw parameter provides a way to cope with this specific situation.

For what it's worth, the "simple" example in the Arioc User Guide illustrates my own personal bias -- that is, that your reads are generally at least 100 bases in length, and that a 100-base mapping would have to contain more than a dozen mismatches before it would be excluded from the alignment results. Given the accuracy and read lengths of aligners these days, that's not a bad assumption, but as you have observed, it certainly isn't a one-size-fits-all solution!

kennyworkman commented 2 months ago

I guess what I am confused by is that the best case score is computed from the seed size (at least in the source code I linked). This does not depend on ingested reads, but is configured in this file. It seems this configuration will always cause the program to fail, you included it in the manual and that is why I am confused. Sorry if I am missing something.

RWilton commented 2 months ago

In that case, I need to look at the code, because the best-case score is supposed to be computed from the maximum read length.

I will let you know...

RWilton commented 2 months ago

Ah, yah, I see a bug: ValidateScoreFunction's caller is specifying the parameters in the wrong order.

I can see that things still work most of the time (i.e., for longer read lengths), but I need to fix the code and then create some test reads of length 25 to verify the fix. I will upload a new build and let you know when it's available.

Thanks for letting me know about this one!

RWilton commented 2 months ago

Updated to v1.52.3144 to fix the above bug. Please let me know what you see at your end.

kennyworkman commented 2 months ago

Thanks @RWilton will build + test shortly and confirm resolution so you can close this off

kennyworkman commented 2 months ago

Builds and runs without issue. Thanks for the quick patch.

Just to clarify though, this bug + the fix have nothing to do with the input read size. I am unsure what the test for reads of length 25 would do because the ValidateScoreFunction was just using the seedWidth, which would cause the program to fail with any input file and eg. a suitably large constant value for Vt like 100. Your patch that corrects the parameter order:

    // validate the score function specified for Vt and Vtw
    this->aas.ValidateScoreFunction( this->a21hs.seedWidth, this->pifgQ->GetEstimatedNmax() );

to

    // validate the score function specified for Vt and Vtw
    this->aas.ValidateScoreFunction( this->pifgQ->GetEstimatedNmax(), this->a21hs.seedWidth );
RWilton commented 2 months ago

Good point. I dug back into the repo and saw that the bug was introduced in build 3106 and released in v1.51 a year ago last May, so there have been plenty of opportunities for the bug to appear. The question is why nobody ran into it before now.

I think the most likely explanation is simply that the bug was elicited when the Vt constraint was a constant. Again, that's something you would probably want to do only if you have constant-length reads, and you would choose that constant value to make sense wrt the expected read length and alignment scoring values.

I suspect that most people specify Vt as a function of read length. In that case, the buggy code computed Vt using a very low read length of 20 (or whatever the seed width was). The computed Vt was obviously bogus, but it passed validation anyway because it was non-negative and lower than the computed alignment score for a perfect mapping (even though that score was also based erroneously on the seed width instead of the read length).

Again, thank you for flagging this problem and giving us a chance to find and fix the bug.

kennyworkman commented 2 months ago

Thanks for the quick turn around. Our team is exploring useful ways to deploy this tool for industry use cases, so will likely continue to stress test and help you improve it. Great work and learning a lot reading your code and papers.