MatteoLacki / IsoSpec

Libraries for fine isotopic structure calculator.
Other
35 stars 10 forks source link

[BUG] fixes bug in subisotope generator #14

Closed hroest closed 5 years ago

hroest commented 5 years ago

I think I found a bug in the subisotope generator that would lead to the calculation of the wrong number of subisotope configuration. Specifically it lead to the calculation of fewer isotopic configuration than necessary as illustrate in the test file.

Rationale and Analysis

At the moment, a ptr to the current configuration was stored in visited which meant the same ptr was stored N times instead of storing N different configurations. This meant that visited.count(currentConf) would not work reliably. Since this is the abort condition in the loop of PrecalculatedMarginal::PrecalculatedMarginal, the loop aborted too early.

Actually, it is surprising that it worked at all since the same data was stored in all entries of visited -- but with a different hash each time since at the time of insertion the correct data was present behind the ptr, however later when currentConf changted, the data also changed.

So this was really broken but amazingly kind of worked due to the unordered_set probably only using the hashes for its count function and never actually comparing the data (which was actually always the same!). However, depending on the implementation or the hash function, at some point some comparison was made and then noticed that all the data was actually the same. So it is just a matter of time until this breaks and leads to an early abort in the PrecalculatedMarginal::PrecalculatedMarginal function, which means not enough subisotopologues were calculated.

Impact

Since the bug relies on implementation-specific behavior of std::unordered_map it is hard to say how strong the impact is since it will depend on platform and implementation. However, it is clear that wrong numbers will be calculated. From my tests below, I see a anything from 20% error to a 2.3-fold error in the correct number of subisotopologues. One question remains is whether this could affect the conclusions from the paper (I hope not).

Fix

Simply store a copy of the current configuration in visited. I chose to make a single copy and store it both visited and configurations since those data structures are read only and there is no need to use up extra memory.

ATTENTION: I did not check whether the same bug is present elsewhere in the code and I only tested the fix for IsoThresholdGenerator.

Test

Testing is somewhat hard since I did not calculate what the correct number of subisotopologues really is, the only one I can say this for sure is that for C100 at very high threshold we should have 101 subisotopologues. MSVS and clang on OSX did not produce that result and we therefore found the bug. However, we also found that on other formulae a smaller number of subisotopologues was calculated (e.g. for insulin this happened above 1e-100).

current output:

-- MSVC
C:\openms\source\IsoSpec\IsoSpec++\build_debug>nr_conf.exe
6
97
89751
111619623
1063898204

-- Linux clang++
6
101
90419
135147222
2480098209
49.54user 2.03system 0:51.61elapsed 99%CPU (0avgtext+0avgdata 7606244maxresident)k
0inputs+0outputs (0major+1907998minor)pagefaults 0swaps

output after bugfix

-- Linux clang++
$ /usr/bin/time ./nr_conf 
6
101
90419
135147222
2480155266
48.21user 1.89system 0:50.11elapsed 99%CPU (0avgtext+0avgdata 7606264maxresident)k
0inputs+0outputs (0major+1907999minor)pagefaults 0swaps

-- MSVC
C:\openms\source\IsoSpec\IsoSpec++\build_debug>nr_conf.exe
6
101
90419
135147222
2480155266

so the results for MSVC before (1063898204) and after (2480155266) show a 2.3x difference!

Task: we should have an independent calculation to verify that these numbers are now correct and there isnt some other bug in the code.

michalsta commented 5 years ago

Hi,

The bugfix looks right. Thankfully, the bug itself was introduced only after we have written the paper, as we have pretty much rewritten most of the codebase since then.

hroest commented 5 years ago

The bugfix looks right. Thankfully, the bug itself was introduced only after we have written the paper, as we have pretty much rewritten most of the codebase since then.

ok, that is good to hear!

Do you have any way to numerically verify the results of IsoSpec, e.g. through exact calculations of the binomial distributions? this would be doable for a single, small molecule just to compare the results of the approximation. I noticed during testing that now the numbers are different but we have no idea which number is correct, so that was worrisome to me.

ATTENTION: I did not check whether the same bug is present elsewhere in the code and I only tested the fix for IsoThresholdGenerator.

can you check whether the same issue is also present elsewhere in the code? E.g. the other generators?

michalsta commented 5 years ago

@ hroest

Do you have any way to numerically verify the results of IsoSpec, e.g. through exact calculations of the binomial distributions? this would be doable for a single, small molecule just to compare the results of the approximation. I noticed during testing that now the numbers are different but we have no idea which number is correct, so that was worrisome to me.

Yes, the plan is basically for me and Mateusz to get our act together and release a stable 1.9.0 over the next few days (as the repo is now at a 1.9.0alpha2 state, and the "alpha" part kind of shows, as we can see...). That will include some proper documentation, some proper git repo cleanup (like properly separating release, master and development branches), ripping out experimental code out of the 1.9.0 branch, and most of all, testing.

Regarding tests, IsoSpec already does exact (well, computer-floating-point-exact) calculations on binomial (and multinomial) distributions, no approximations involved, just occasional plain and simple bugs... ;) To hunt down these we are planning to do exactly what we did for 1.0.X release, that is, write some tests that compare the results of IsoSpec with the results of other isotopic libraries, like envipat or ecipex.

can you check whether the same issue is also present elsewhere in the code? E.g. the other generators?

Yeah, the other generators are fine. At least wrt. this bug. The Layered one is still experimental as heck, and will blow up in people's faces given the chance, we'll rip it out for 1.9.0, and port the one from 1.0.7 to the new interface. The new experimental one has some dramatic speed improvements, but it will take a lot more work to get this one right, and I just don't want to wait with 1.9.0 until that happens. It'll be the main feature of the 2.0 release, along, (possibly) with multithreading. The Ordered and Threshold generators are in a beta state - not yet seriously tested, but as far as I know they should work.

Regarding OpenMS integration: I think that a sensible path forward is for you to see if you need any more "architectural"-kind of changes, that will make keeping the codebases in sync easier, like introducing a separate namespace, or doing the "#pragma once" instead of old-style header guards. You can leave serious testing, general code cleanup and bughunting to us, you have done enough of our work for us already ;) just give us a few days.

hroest commented 5 years ago

Yes, the plan is basically for me and Mateusz to get our act together and release a stable 1.9.0 over the next few days (as the repo is now at a 1.9.0alpha2 state, and the "alpha" part kind of shows, as we can see...). That will include some proper documentation, some proper git repo cleanup (like properly separating release, master and development branches), ripping out experimental code out of the 1.9.0 branch, and most of all, testing.

:+1: great!

Regarding tests, IsoSpec already does exact (well, computer-floating-point-exact) calculations on binomial (and multinomial) distributions, no approximations involved, just occasional plain and simple bugs... ;) To hunt down these we are planning to do exactly what we did for 1.0.X release, that is, write some tests that compare the results of IsoSpec with the results of other isotopic libraries, like envipat or ecipex.

sounds good to me

Regarding OpenMS integration: I think that a sensible path forward is for you to see if you need any more "architectural"-kind of changes, that will make keeping the codebases in sync easier, like introducing a separate namespace, or doing the "#pragma once" instead of old-style header guards. You can leave serious testing, general code cleanup and bughunting to us, you have done enough of our work for us already ;) just give us a few days.

sounds great, I am happy to do that and let you work on the release. I think we mostly have what we need in terms of architectural changes and we are looking forward to the new release 1.9.0 - hopefully we can just use it as a drop-in replacement on the OpenMS side. Feel free to also continue to our discussion over at OpenMS on what kind of interface to IsoSpec people might want to have. I think I have revised my opinion a bit and I think having the Layered generator (the one providing a total probability cutoff) would also be important to have in OpenMS. I think your argument is correct that this allows one to control the error rate of binned / aggregated spectra while a absolute threshold may not have a clear representation in how much error the probabilities have once you aggregate / bin.