RWilton / Arioc

Arioc: GPU-accelerated DNA short-read alignment
BSD 3-Clause "New" or "Revised" License
59 stars 8 forks source link

Potentially incorrect configuration values in documentation #36

Open kennyworkman opened 6 hours ago

kennyworkman commented 6 hours ago

Hi @RWilton, thank you for your work.

On page 24 of the manual linked in the README, you have these attributes for the gapped alignment.

<gapped seed="hsi20_0_30" maxJ="200" Wmxgs="2,6,5,3" Vt="100" />

From my understanding this constant score threshold will always exceed the maximum possible score computed with a seed size of 20. The ValidateScoreFunction seems to compute 40, if the BWm is 2.

    // threshold score for a Q sequence of length N (windowed gapped alignments only)
    double Vtw = (ASP.tpGw.sfT == ScoreFunctionType::G) ? ASP.tpGw.sfA*log(static_cast<double>(Nmax)) + ASP.tpGw.sfB :
                 (ASP.tpGw.sfT == ScoreFunctionType::S) ? ASP.tpGw.sfA*sqrt(static_cast<double>(Nmax)) + ASP.tpGw.sfB :
                 (ASP.tpGw.sfT == ScoreFunctionType::L) ? ASP.tpGw.sfA*Nmax + ASP.tpGw.sfB :
                                                          ASP.tpGw.sfB;
    if( (Vtw > Vp) || (Vtw <= 0) )
        throw new ApplicationException( __FILE__, __LINE__, "The specified Vtw score function (%s) is invalid for a query sequence of length %d", AriocAlignmentScorer::ScoreFunctionToString(ASP.tpGw), Nmax );

Indeed this is the error I get when I run AriocP with the above configuration.

191202064 [00000de9] AriocP v1.52.3143.24176 (release) +2024-06-24T11:14:50
191202067 [00000de9] Copyright (c) 2015-2024 Johns Hopkins University.  All rights reserved.
191202067 [00000de9]  compiled             : 2024-09-06T06:44:19 (GNU g++ v11.4.0)
191202067 [00000de9]  data type sizes      : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian)
191202067 [00000de9]  computer name        : <redacted>
191202067 [00000de9]  operating system     : Ubuntu 22.04.4 LTS
191202067 [00000de9]  executable file      : /root/bin/AriocP
191202067 [00000de9]  configuration file   : /root/arioc-config/align.cfg
191202757 [00000de9] ApplicationException ([0x00003561] AriocCommon/AriocAlignmentScorer.cpp 177): The specified Vt score function (100) is invalid for a query sequence of length 25

Switching to a linear scoring function, with a negative offset, as I've seen you do in these issues allows the code to run. Am I missing something?

RWilton commented 5 hours ago

No, you are not missing anything! This is expected behavior from the aligner's point of view. After all, it makes no sense to specify an alignment score threshold Vt of 100 when you already know that the input read sequences can be as short as 25 bases and that those perfectly mapped 25-base reads will have an alignment score of 50.

The solution is to choose appropriate alignment scoring parameters for your reads, and that depends on your own good judgment.

If every read in the input has the same length, it is simplest to specify Vt as a constant. (In the case of 25-base reads with a base-match score of 2, perhaps Vt = 40 would be reasonable if you are using the default alignment score values.) If the input reads have different lengths, then it probably makes more sense to specify Vt as a function of the read length. (Don't forget that AriocE's output log contains information regarding the range of read lengths in your data.)

Either way, the more reads you have that do not map with perfect alignment scores, the more it pays to use your alignment results to go back and tune Vt. If Vt is too high, the aligner will exclude too many acceptable mappings from your results. If Vt is too low, the aligner will report too many questionable mappings. You can start by making assumptions about how many mismatches and how many gaps you would accept in any given mapping, and then choosing the scoring parameters (including Vt) accordingly.

Paired-end reads complicate things a bit because the most interesting mappings may be those where one mate maps with a high alignment score and the opposite mate maps with a borderline alignment score in a region of interest (a region of high variability, for example). Arioc's Vtw parameter provides a way to cope with this specific situation.

For what it's worth, the "simple" example in the Arioc User Guide illustrates my own personal bias -- that is, that your reads are generally at least 100 bases in length, and that a 100-base mapping would have to contain more than a dozen mismatches before it would be excluded from the alignment results. Given the accuracy and read lengths of aligners these days, that's not a bad assumption, but as you have observed, it certainly isn't a one-size-fits-all solution!

kennyworkman commented 4 hours ago

I guess what I am confused by is that the best case score is computed from the seed size (at least in the source code I linked). This does not depend on ingested reads, but is configured in this file. It seems this configuration will always cause the program to fail, you included it in the manual and that is why I am confused. Sorry if I am missing something.

RWilton commented 2 hours ago

In that case, I need to look at the code, because the best-case score is supposed to be computed from the maximum read length.

I will let you know...

RWilton commented 2 hours ago

Ah, yah, I see a bug: ValidateScoreFunction's caller is specifying the parameters in the wrong order.

I can see that things still work most of the time (i.e., for longer read lengths), but I need to fix the code and then create some test reads of length 25 to verify the fix. I will upload a new build and let you know when it's available.

Thanks for letting me know about this one!

RWilton commented 23 minutes ago

Updated to v1.52.3144 to fix the above bug. Please let me know what you see at your end.