COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
777 stars 165 forks source link

Salmon v0.10.0: Error in function boost::math::digamma<double>(double): numeric overflow] #229

Closed knokknok closed 6 years ago

knokknok commented 6 years ago

Sorry, me (#228) again... While most of my samples seem to work on linux, I got an exception for one of them:

Version Info: This is the most recent version of Salmon. ### salmon (mapping-based) v0.10.0 ### [ program ] => salmon ### [ command ] => quant ### [ index ] => { salmon010.index.all_combined } ### [ libType ] => { A } ### [ mates1 ] => { R1.fastq.gz } ### [ mates2 ] => { R2.fastq.gz } ### [ posBias ] => { } ### [ gcBias ] => { } ### [ seqBias ] => { } ### [ useVBOpt ] => { } ### [ validateMappings ] => { } ### [ output ] => { processed_salmon0100_k31_allcombined/R } Logs will be written to processed_salmon0100_k31_allcombined/R/logs [2018-05-31 16:54:42.310] [jointLog] [info] Fragment incompatibility prior below threshold. Incompatible fragments will be ignored. [2018-05-31 16:54:42.310] [jointLog] [info] Usage of --validateMappings implies use of range factorization. rangeFactorizationBins is being set to 4 [2018-05-31 16:54:42.310] [jointLog] [info] parsing read library format [2018-05-31 16:54:42.310] [jointLog] [info] There is 1 library. [2018-05-31 16:54:42.480] [jointLog] [info] Loading Quasi index [2018-05-31 16:54:42.501] [jointLog] [info] Loading 32-bit quasi index [2018-05-31 16:54:42.501] [stderrLog] [info] Loading Suffix Array [2018-05-31 16:55:01.293] [stderrLog] [info] Loading Transcript Info [2018-05-31 16:55:06.428] [stderrLog] [info] Loading Rank-Select Bit Array [2018-05-31 16:55:07.107] [stderrLog] [info] There were 310732 set bits in the bit array [2018-05-31 16:55:07.158] [stderrLog] [info] Computing transcript lengths [2018-05-31 16:55:07.159] [stderrLog] [info] Waiting to finish loading hash [2018-05-31 16:55:25.973] [jointLog] [info] done [2018-05-31 16:55:25.973] [jointLog] [info] Index contained 310732 targets [2018-05-31 16:55:25.973] [stderrLog] [info] Done loading index

processed 67500000 fragmentsointLog] [info] Automatically detected most likely library type as IU hits: 224580543, hits per frag: 3.35031[2018-05-31 17:08:09.486] [jointLog] [info] Thread saw mini-batch with a maximum of 1.04% zero probability fragments [2018-05-31 17:08:09.486] [jointLog] [info] Thread saw mini-batch with a maximum of 0.96% zero probability fragments [2018-05-31 17:08:09.486] [jointLog] [info] Thread saw mini-batch with a maximum of 0.84% zero probability fragments [2018-05-31 17:08:09.486] [jointLog] [info] Thread saw mini-batch with a maximum of 0.96% zero probability fragments [2018-05-31 17:08:09.486] [jointLog] [info] Thread saw mini-batch with a maximum of 0.88% zero probability fragments [2018-05-31 17:08:09.486] [jointLog] [info] Thread saw mini-batch with a maximum of 1.00% zero probability fragments [2018-05-31 17:08:09.486] [jointLog] [info] Thread saw mini-batch with a maximum of 0.92% zero probability fragments [2018-05-31 17:08:09.486] [jointLog] [info] Thread saw mini-batch with a maximum of 0.84% zero probability fragments [2018-05-31 17:08:09.486] [jointLog] [info] Thread saw mini-batch with a maximum of 0.84% zero probability fragments [2018-05-31 17:08:09.486] [jointLog] [info] Thread saw mini-batch with a maximum of 0.86% zero probability fragments [2018-05-31 17:08:09.486] [jointLog] [info] Thread saw mini-batch with a maximum of 0.94% zero probability fragments [2018-05-31 17:08:09.486] [jointLog] [info] Thread saw mini-batch with a maximum of 0.96% zero probability fragments [2018-05-31 17:08:09.486] [jointLog] [info] Thread saw mini-batch with a maximum of 0.98% zero probability fragments [2018-05-31 17:08:09.486] [jointLog] [info] Thread saw mini-batch with a maximum of 0.92% zero probability fragments [2018-05-31 17:08:09.486] [jointLog] [info] Thread saw mini-batch with a maximum of 0.94% zero probability fragments [2018-05-31 17:08:09.486] [jointLog] [info] Thread saw mini-batch with a maximum of 0.98% zero probability fragments [2018-05-31 17:08:09.486] [jointLog] [info] Thread saw mini-batch with a maximum of 0.92% zero probability fragments [2018-05-31 17:08:09.486] [jointLog] [info] Thread saw mini-batch with a maximum of 1.04% zero probability fragments [2018-05-31 17:08:09.486] [jointLog] [info] Thread saw mini-batch with a maximum of 1.02% zero probability fragments [2018-05-31 17:08:09.486] [jointLog] [info] Thread saw mini-batch with a maximum of 0.92% zero probability fragments [2018-05-31 17:08:09.486] [jointLog] [info] Thread saw mini-batch with a maximum of 0.92% zero probability fragments [2018-05-31 17:08:09.486] [jointLog] [info] Thread saw mini-batch with a maximum of 0.94% zero probability fragments [2018-05-31 17:08:09.486] [jointLog] [info] Thread saw mini-batch with a maximum of 0.88% zero probability fragments [2018-05-31 17:08:09.486] [jointLog] [info] Thread saw mini-batch with a maximum of 0.90% zero probability fragments [2018-05-31 17:08:09.486] [jointLog] [info] Thread saw mini-batch with a maximum of 0.94% zero probability fragments [2018-05-31 17:08:09.486] [jointLog] [info] Thread saw mini-batch with a maximum of 0.86% zero probability fragments [2018-05-31 17:08:09.486] [jointLog] [info] Thread saw mini-batch with a maximum of 0.86% zero probability fragments [2018-05-31 17:08:09.486] [jointLog] [info] Thread saw mini-batch with a maximum of 0.86% zero probability fragments [2018-05-31 17:08:09.498] [jointLog] [info] Thread saw mini-batch with a maximum of 0.90% zero probability fragments [2018-05-31 17:08:09.534] [jointLog] [info] Thread saw mini-batch with a maximum of 0.90% zero probability fragments [2018-05-31 17:08:09.572] [jointLog] [info] Thread saw mini-batch with a maximum of 1.04% zero probability fragments [2018-05-31 17:08:09.634] [jointLog] [info] Thread saw mini-batch with a maximum of 0.90% zero probability fragments [2018-05-31 17:08:09.641] [jointLog] [info] Thread saw mini-batch with a maximum of 1.00% zero probability fragments [2018-05-31 17:08:09.649] [jointLog] [info] Thread saw mini-batch with a maximum of 0.88% zero probability fragments [2018-05-31 17:08:09.690] [jointLog] [info] Thread saw mini-batch with a maximum of 0.96% zero probability fragments [2018-05-31 17:08:09.702] [jointLog] [info] Thread saw mini-batch with a maximum of 0.92% zero probability fragments [2018-05-31 17:08:09.721] [jointLog] [info] Thread saw mini-batch with a maximum of 0.88% zero probability fragments [2018-05-31 17:08:09.728] [jointLog] [info] Thread saw mini-batch with a maximum of 0.90% zero probability fragments [2018-05-31 17:08:09.782] [jointLog] [info] Thread saw mini-batch with a maximum of 0.86% zero probability fragments [2018-05-31 17:08:09.786] [jointLog] [info] Thread saw mini-batch with a maximum of 0.90% zero probability fragments

[2018-05-31 17:08:10.483] [jointLog] [info] Computed 582392 rich equivalence classes for further processing [2018-05-31 17:08:10.483] [jointLog] [info] Counted 59985194 total reads in the equivalence classes [2018-05-31 17:08:10.551] [jointLog] [info] Mapping rate = 88.2723%

[2018-05-31 17:08:10.551] [jointLog] [info] finished quantifyLibrary() [2018-05-31 17:08:10.551] [jointLog] [info] Starting optimizer [2018-05-31 17:08:11.488] [jointLog] [info] Marked 1 weighted equivalence classes as degenerate [2018-05-31 17:08:11.523] [jointLog] [info] iteration = 0 | max rel diff. = 127.399 Exception : [Error in function boost::math::digamma(double): numeric overflow] salmon quant was invoked improperly. For usage information, try salmon quant --help Exiting.

rob-p commented 6 years ago

This seems to be a separate issue than the other (and a more informative exception). Once I've resolved the other issue, I would probably try to bug you for a sample that causes this --- though I have a reasonable idea about how to fix it. It would be nice to have the fix for both issues in the same hotfix.

To be more specific : this is, as the exception says, a numeric underflow issue when evaluating the digamma function. The solution here is just to bump up the value that is required before evaluating this function. This should be straightforward, but I suspect the issue is also related to this log message:

[2018-05-31 17:08:11.488] [jointLog] [info] Marked 1 weighted equivalence classes as degenerate

knokknok commented 6 years ago

but I suspect the issue is also related to this log message

I wondered about that too but other samples gave me 2 and 3 degenerate classes and still passed...

rob-p commented 6 years ago

Right --- such classes should be removed (specifically because they can cause such underflow issues). I suspect that I just need to make the bound for evaluation more conservative. I chose the smallest value that worked on my testing machine, but I imagine that when underflow occurs could be slightly different on different machines. I should just find / choose a stricter bound.

knokknok commented 6 years ago

Think you could tell me which line to edit? Since I can now work on a linux box this issue is now more pressing for me than the other one...

rob-p commented 6 years ago

Yes. There are two places to look. The first is possibly changing this cutoff. The other is putting a guard around this line that only evaluates the digamma if alphaSum is greater than the digamma min cutoff. If alphaSum fails that check (it really should not), then you can likely just skip the rest of that function.

rob-p commented 6 years ago

I was able to fix the other bug you encountered (on OSX) --- it should now be fixed in the develop branch with a clean build. That makes this bug my current top priority. I'd like to fix it so I can merge both changes into master and cut a v0.10.1 release. Let me know the best way I can help address this one.

knokknok commented 6 years ago

I tried wrapping the code around alphaSum but it didn't work. I also set digammaMin to 1e-9 but no change. What I find weird is that the following code works...

#include <boost/math/special_functions/digamma.hpp>

int main() {
  double logNorm = boost::math::digamma(1e-50);
  printf("%f\n", logNorm);
  return logNorm;
}
rob-p commented 6 years ago

I have a thought --- can you try compiling the above with -O3?

P.S. nevermind --- doesn't seem to make a difference on my end.

knokknok commented 6 years ago

It also works with -O3. Instead of the if (ap > ::digammaMin) { lines, would it make sense to catch the exception and set expTheta[i] = 0.0; if thrown?

rob-p commented 6 years ago

I believe that would be the appropriate behavior, yes. However, I'm quite curious now what argument to digamma is causing this as I'm able to go to tiny values of the argument without seeing an exception.

knokknok commented 6 years ago

If you implement the try/catch and print the value in the catch code I might be able to tell you ;-)

rob-p commented 6 years ago

I'll be happy to go ahead an implement this; but the numeric_error exception that is thrown doesn't contain the value that caused the error (just that string that is printed out) Error in function boost::math::digamma(double): numeric overflow. The try/catch should be easy to do, but I'm quite interested in how this can even be triggered given that the range check is already being done.

rob-p commented 6 years ago

Ok, I pushed a commit to develop that wraps this in try/catch (and also tries to print out the relevant value of the argument from the context). Please let me know if (1) this averts the uncaught exception and (2) what argument to digamma is triggering this strange behavior :)!

knokknok commented 6 years ago

[2018-05-31 20:01:46.042] [jointLog] [info] Marked 1 weighted equivalence classes as degenerate [2018-05-31 20:01:46.073] [jointLog] [info] iteration = 0 | max rel diff. = 127.422 Exception : [Error in function boost::math::digamma(double): numeric overflow] salmon quant was invoked improperly. For usage information, try salmon quant --help Exiting.

:-O

knokknok commented 6 years ago

Could it be coming from some other part of the code or some dependency?

rob-p commented 6 years ago

Well, in the other branch, expTheta is directly set to zero so that digamma is not called. Are you compiling from the develop branch? I'll try and see if it could be coming from anywhere else in the code.

rob-p commented 6 years ago

A quick grep shows these as the only uses of digamma in the code:

../src/CollapsedEMOptimizer.cpp:117:  double logNorm = boost::math::digamma(alphaSum);
../src/CollapsedEMOptimizer.cpp:126:          std::exp(boost::math::digamma(ap) - logNorm);
../src/CollapsedEMOptimizer.cpp:257:  double logNorm = boost::math::digamma(alphaSum);
../src/CollapsedEMOptimizer.cpp:270:                              std::exp(boost::math::digamma(ap) - logNorm);
knokknok commented 6 years ago

### salmon (mapping-based) v0.10.1

rob-p commented 6 years ago

the fact that this Exception : [Error in function boost::math::digamma(double): numeric overflow] is printing out means that it is being caught by the "catch-all" exception handling here. Could you also make sure that this thing isn't triggered if you don't pass --useVBOpt (where no digamma should be called)?

rob-p commented 6 years ago

Also, I assumed (according to the documentation) that these underflow and overflow errors inherit from std::runtime_error, but I've updated the try/catch with a more generic exception class just in case they are not.

knokknok commented 6 years ago

I have added printfs before every digamma and I get: alphaSum: inf So the try/catch is not at the right place ;-)

rob-p commented 6 years ago

Brilliant work ... and incredibly strange. Somehow, either one of the inputs in this sum is infinite, or adding them up leads to an infinity!

knokknok commented 6 years ago

EQVecT i:82396 M:310732 alphaIn:11098.990511 priorAlphas:0.025489 EQVecT i:83967 M:310732 alphaIn:inf priorAlphas:0.001380 EQVecT i:84187 M:310732 alphaIn:11102.387087 priorAlphas:0.036369

rob-p commented 6 years ago

Interesting: so here EQVecT i is the transcript id, right? This means transcript 83967 already has an infinite alpha value as input? Do we know what exact round of the algorithm this is? We know it's between 0 and 100.

knokknok commented 6 years ago

EQVecT means we are in the function with template <typename EQVecT> i is the index of the loop that generates alphaSum If I printf "Round" line 257, I get: [2018-05-31 22:27:34.996] [jointLog] [info] Starting optimizer Round [2018-05-31 22:27:35.226] [jointLog] [info] Marked 1 weighted equivalence classes as degenerate Round [2018-05-31 22:27:35.257] [jointLog] [info] iteration = 0 | max rel diff. = 127.379 Exception : [Error in function boost::math::digamma(double): numeric overflow]

rob-p commented 6 years ago

Yes, so in the loop that computes alphaSum, i is the transcript id. Regarding the round, can you print the value of itNum between lines 892 and 893? This will say what round of the VBEM we are executing. Basically, I want to figure out if we have an infinite value on the first iteration, or if some previous round of the VBEM resulted in an infinite output value being placed into some transcript --- and if so, what round this happens in.

knokknok commented 6 years ago

BTW I don't know if it could be related but make test gives me: Running tests... Test project salmon/build Start 1: unit_tests 1/3 Test #1: unit_tests .......................***Failed 0.02 sec Start 2: salmon_read_test_fmd 2/3 Test #2: salmon_read_test_fmd ............. Passed 1.74 sec Start 3: salmon_read_test_quasi 3/3 Test #3: salmon_read_test_quasi ........... Passed 1.62 sec

67% tests passed, 1 tests failed out of 3

Total Test time (real) = 3.37 sec

The following tests FAILED: 1 - unit_tests (Failed) Errors while running CTest Makefile:151: recipe for target 'test' failed make: *** [test] Error 8

knokknok commented 6 years ago

itNum: 0 itNum: 1 [2018-05-31 22:48:54.566] [jointLog] [info] Computed 583973 rich equivalence classes for further processing [2018-05-31 22:48:54.566] [jointLog] [info] Counted 59985214 total reads in the equivalence classes [2018-05-31 22:48:54.609] [jointLog] [info] Mapping rate = 88.2723%

[2018-05-31 22:48:54.609] [jointLog] [info] finished quantifyLibrary() [2018-05-31 22:48:54.610] [jointLog] [info] Starting optimizer [2018-05-31 22:48:54.829] [jointLog] [info] Marked 1 weighted equivalence classes as degenerate [2018-05-31 22:48:54.859] [jointLog] [info] iteration = 0 | max rel diff. = 127.38 Exception : [Error in function boost::math::digamma(double): numeric overflow] salmon quant was invoked improperly.

rob-p commented 6 years ago

probably unrelated, but also unexpected. Make test on my linux and osx box looks like:

$ make test
Running tests...
Test project /Users/rob/salmon/build
    Start 1: unit_tests
1/3 Test #1: unit_tests .......................   Passed    0.13 sec
    Start 2: salmon_read_test_fmd
2/3 Test #2: salmon_read_test_fmd .............   Passed    0.87 sec
    Start 3: salmon_read_test_quasi
3/3 Test #3: salmon_read_test_quasi ...........   Passed    0.39 sec

100% tests passed, 0 tests failed out of 3

Total Test time (real) =   1.41 sec

It looks the same on the continuous integration server :

Running tests...
/usr/local/cmake-3.9.2/bin/ctest --force-new-ctest-process 
Test project /home/travis/build/COMBINE-lab/salmon/build
    Start 1: unit_tests
1/3 Test #1: unit_tests .......................   Passed    0.13 sec
    Start 2: salmon_read_test_fmd
2/3 Test #2: salmon_read_test_fmd .............   Passed    2.55 sec
    Start 3: salmon_read_test_quasi
3/3 Test #3: salmon_read_test_quasi ...........   Passed    1.72 sec
100% tests passed, 0 tests failed out of 3
Total Test time (real) =   4.41 sec

Also, you can look, in the build directory, in the subdirectory Testing/Temporary/LastTestsFailed.log which will give details of which specific test failed.

rob-p commented 6 years ago

Ok, so it looks like we complete iteration 0, so that the initial alphas (before we start the VBEM) is OK. But after one call to VBEMUpdate_, we have a transcript with an infinite count. So it happens almost immediately.

knokknok commented 6 years ago

cat Testing/Temporary/LastTestsFailed.log 1:unit_tests

rob-p commented 6 years ago

ha ... that log isn't particularly interesting --- what happens if you run ./src/unitTests from the build directory?

knokknok commented 6 years ago

./src/unitTests

All tests passed (108 assertions in 4 test cases)

rob-p commented 6 years ago

hrmmmmm .... so then the failure is stochastic?

knokknok commented 6 years ago

Well... ./src/unitTests seems to always pass and make test to always fail...

rob-p commented 6 years ago

can you try make install before make test?

knokknok commented 6 years ago

$ sudo make install [ 7%] Built target libcereal [ 14%] Built target libdivsufsort [ 21%] Built target libstadenio [ 28%] Built target libbwa [ 36%] Built target libgff [ 42%] Built target libspdlog [ 47%] Built target ksw2pp_basic [ 49%] Built target ksw2pp_sse4 [ 52%] Built target ksw2pp_sse2 [ 53%] Built target ksw2pp [ 55%] Built target alevin_core [ 69%] Built target salmon_core [ 74%] Built target unitTests [100%] Built target salmon Install the project... -- Install configuration: "Release" -- Up-to-date: /usr/local/lib -- Up-to-date: /usr/local/lib/libtbbmalloc.so -- Up-to-date: /usr/local/lib/pkgconfig -- Up-to-date: /usr/local/lib/libtbb.so -- Up-to-date: /usr/local/lib/libtbb.so.2 -- Up-to-date: /usr/local/lib/libtbbmalloc_proxy.so.2 -- Up-to-date: /usr/local/lib/libtbbmalloc_proxy.so -- Up-to-date: /usr/local/lib/libtbbmalloc.so.2 -- Up-to-date: /usr/local/bin/salmon -- Up-to-date: /usr/local/lib/libsalmon_core.a

Installation complete. Please ensure the following paths are set properly.

Please add /usr/local/bin to your PATH Please add /usr/local/lib to your LD_LIBRARY_PATH

$ make test Running tests... Test project salmon/build Start 1: unit_tests 1/3 Test #1: unit_tests .......................***Failed 0.02 sec Start 2: salmon_read_test_fmd 2/3 Test #2: salmon_read_test_fmd ............. Passed 1.67 sec Start 3: salmon_read_test_quasi 3/3 Test #3: salmon_read_test_quasi ........... Passed 1.62 sec

67% tests passed, 1 tests failed out of 3

Total Test time (real) = 3.32 sec

The following tests FAILED: 1 - unit_tests (Failed) Errors while running CTest Makefile:151: recipe for target 'test' failed make: *** [test] Error 8

knokknok commented 6 years ago

Ok, GMT+2 here, have to go ;-)

rob-p commented 6 years ago

It must be a different unitTest executable that is running, since make test is just using CMake to run the same executable in ./src/unitTests. I'm happy to help debug this further, but I'm currently most interested in how a transcript gets an infinite count after round 0.

rob-p commented 6 years ago

Ok, GMT+2 here, have to go ;-)

Of course; thank you so much for your help today! Glad we were able to squash the other bug. I'll think more about this vbOpt bug. If I can find a dataset to run that triggers it, I can probably debug it faster (like the other bug).

knokknok commented 6 years ago

thank you so much for your help today!

Sure, after all it is quite selfish ;-)

I have been trying to generate smaller transcript db and fastq files but to no avail (can't make it fail again)

However it looks like your test if (denom <= ::minEQClassWeight) could be not stringent enough: invDenom:inf count:1 denom:2.77171e-321 minEQClassWeight:4.94066e-324 invDenom:inf count:1 denom:4.69042e-316 minEQClassWeight:4.94066e-324

These two lines result from: groupSize: 2 i:0 tid:83966 aux:0.756044 expTheta[tid]:0 i:1 tid:83967 aux:0.243956 expTheta[tid]:7.23806e-321

groupSize: 2 i:0 tid:190925 aux:0.542131 expTheta[tid]:0 i:1 tid:272773 aux:0.457869 expTheta[tid]:6.2423e-316

knokknok commented 6 years ago

I have replaced if (denom <= ::minEQClassWeight) with if (denom <= 1e-200) { and my samples seem to pass... Although I have no idea if such low denom values are "normal" and/or if this fix can have a negative impact on the resulting quantifications...

rob-p commented 6 years ago

Just to get more info about exactly the equivalence class causing this problem, can you try to change this definition of verbose{false} to verbose{true} ? https://github.com/COMBINE-lab/salmon/blob/develop/src/CollapsedEMOptimizer.cpp#L376

rob-p commented 6 years ago

Also, out of curiosity. Does it work if you make 1e-200 to be std::numeric_limits<double>::min()?

knokknok commented 6 years ago

Also, out of curiosity. Does it work if you make 1e-200 to be std::numeric_limits::min()?

I'll tell you later but it might:

int main() { printf("%g\n", 1/1e-310); printf("%g\n", std::numeric_limits::min()); printf("%g\n", 1/std::numeric_limits::min()); }

inf 2.22507e-308 4.49423e+307

knokknok commented 6 years ago

It did pass on one of the problematic samples. Also: Dropping weighted eq class \============================ denom = 0, count = 1 class = { 140152 140153 140154 } alphas = { 0 0 0 } weights = { 0.954631 0.029385 0.0159838 } \============================

rob-p commented 6 years ago

Awesome! Thanks for checking it out. I think having denorm_min become min makes sense. I'm gonna push both this and #228 to master and cut a 0.10.1 release (and I'll be sure to thank you in the release notes). I'll close these issues once that release is done.

rob-p commented 6 years ago

These fixes are now live in v0.10.1 --- thanks again!

knokknok commented 6 years ago

Great! Thanks! (Actually it is knokknok, the correct spelling was already taken ;-) )