Question about log probabilities

MatteoLacki / IsoSpec

Libraries for fine isotopic structure calculator.

Other

35 stars 10 forks source link

Question about log probabilities #9

Closed hroest closed 5 years ago

hroest commented 5 years ago

I am trying to understand this piece of code:

https://github.com/MatteoLacki/IsoSpec/blob/61bcca5e79acb35854742d0b10c6a59a80726cb5/IsoSpec%2B%2B/marginalTrek%2B%2B.cpp#L142-L148

It seems that you are computing the log of a number but then you actually go and look up the log in a lookup table and overwrite the initial value with the one from the lookup table. I am trying to understand why this is done and why it is necessary.

The reason I am asking is because we want to use IsoSpec in OpenMS, but we would like to have the element tables and frequencies configurable by the user in an XML document. So we would like to provide our own elemental frequencies and I have found two places where IsoSpec uses the tables with frequencies, here and in parse_formula. It is clea why the latter is needed but I cannot understand why IsoSpec needs to access the tables again a second time here.

michalsta commented 5 years ago

The reason it is done is for a bit more of numeric correctness: the function basically calculates the log of an element's probability. However, when the element's probability is is equal to one of the pre-defined probabilities of one of the elements (which happens whenever the user does not define his own probabilities) then the computed log is replaced by one from element_tables.h. These are logically the same thing, only calculated with more precision than the stdlib's log function has. If we cannot use that (because the user has defined non-standard probabilities), then, oh well, we use stdlib's log and hope for the best. This is probably a bit too much of numeric paranoia on our side and can probably be safely omitted in OpenMS.

By the way: we have already started a project of OpenMS integration, but it is in a (very) unfinished state and we had to put it on temporary hiatus due to other projects piling up... However, if it helps you, feel free to have a look at: https://github.com/michalsta/OpenMS and grab from there whatever you might find useful :)

Also, if you want to collaborate on this, we'll be happy to help you any way we can.

hroest commented 5 years ago

Ok, that explains it. It could be too much paranoia, especially as the isotopic abundances and masses are also not that accurate, they all come with an experimental error -- also did you know that of course in the literature they report average isotopic abundances, but of course these change depending on geography.

Do you mean this branch: https://github.com/michalsta/OpenMS/tree/features/fgid%2Bisospec ? I have seen it and we do something quite similar. Maybe you can have a look at the PR and provide some comments https://github.com/OpenMS/OpenMS/pull/3766/files#diff-e5d91bee7c44ad19971ed8bfd698ad84 ? That would be quite helpful.

So far I have only implemented the "threshold" method, I am not sure whether the other methods should also be exposed through OpenMS. What is your intuition on that?

MatteoLacki commented 5 years ago

Hi,

there's a multitude of reasons for the frequencies to change: not only geography, but also geology, and biology are in the play. You're right about that. I would really like to add to IsoSpec the possibility to aggregate the spectra to a given precision (or according to a given binning strategy, which might be better for some Orbitrap data) on the flight because right now I have to do it in Python in another project.

Mateusz Krzysztof Łącki

German tel. +49 159 01681376 Polish tel. +48 579 647 311 Skype: mathewin GitHub: MatteoLacki https://github.com/MatteoLacki

On Thu, Oct 4, 2018 at 8:21 PM Hannes Roest notifications@github.com wrote:

Ok, that explains it. It could be too much paranoia, especially as the isotopic abundances and masses are also not that accurate, they all come with an experimental error -- also did you know that of course in the literature they report average isotopic abundances, but of course these change depending on geography.

Do you mean this branch: https://github.com/michalsta/OpenMS/tree/features/fgid%2Bisospec ? I have seen it and we do something quite similar. Maybe you can have a look at the PR and provide some comments https://github.com/OpenMS/OpenMS/pull/3766/files#diff-e5d91bee7c44ad19971ed8bfd698ad84 ? That would be quite helpful.

So far I have only implemented the "threshold" method, I am not sure whether the other methods should also be exposed through OpenMS. What is your intuition on that?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MatteoLacki/IsoSpec/issues/9#issuecomment-427120602, or mute the thread https://github.com/notifications/unsubscribe-auth/ADx9AOnYN7eB4hLQLbAzz-iX8S7O5fa1ks5uhlHFgaJpZM4XIjX7 .

michalsta commented 5 years ago

Yeah, the experimental error (and geographic variation) is of course orders of magnitude above the error introduced by the stdlib's log function - however the algorithm will go awry if these probabilities do actually sum up to less than 1.0 due to numeric errors (and that was the case with probabilities we got from IUPAC, we had to rescale them manually), hence the precalculated values. And the slight paranoia regarding numerics ;)

Basically, when the user asks the algorithm to cover like 0.999999 of configuration space of some gargantuan molecule, and it turns out that the probabilities actually sum up to 0.9999973 due to numerics, then the algorithm will try to produce ALL possible configurations - and, this being a gargantuan molecule and the configuration space having 10^(large number) of elements it will crash the machine while trying to store them.

A quick and easy (if a bit dirty) way to ensure this does not happen, while still dropping the precalculated probability tables, would be to rescale any input probabilities in such a way that they sum to sth like 1.0000001 - noone probably needs this kind of precision, IUPAC certainly does not provide it, and it will ensure that the algorithm will work.

Note though that this only pertains to algorithm versions other than the threshold one, those that try to fill a percentage of the configuration space - threshold should work fine in any conditions.

I'll have a look at the PR over the weekend, it's sorta large to go through in one sitting ;)

Regarding whether to expose other methods through OpenMS: that would be nice, though perhaps it could be a longer-term goal, especially as the multithreaded versions, and the layered versions are still a Work-In-Progress for us. The threshold version on the other hand is pretty much finished, it's hard to imagine what we could still optimize there. So is the ordered version, so maybe that too.

hroest commented 5 years ago

I'll have a look at the PR over the weekend, it's sorta large to go through in one sitting ;)

I think the only file that is important to look at is this one https://github.com/OpenMS/OpenMS/pull/3766/files#diff-e5d91bee7c44ad19971ed8bfd698ad84 src/openms/source/CHEMISTRY/ISOTOPEDISTRIBUTION/IsoSpec.cpp all the other files are the library itself, tests and wrapper.

Note though that this only pertains to algorithm versions other than the threshold one, those that try to fill a percentage of the configuration space - threshold should work fine in any conditions.

ok, I think that makes a lot of sense as we are mostly looking for a robust solution. Also practically speaking, most people are probably interested in the topN peaks or the peaks above a threshold since that is what the mass spec can measure, the "coverage of probability space" is more of theoretical interest in my view.

Regarding whether to expose other methods through OpenMS: that would be nice, though perhaps it could be a longer-term goal, especially as the multithreaded versions, and the layered versions are still a Work-In-Progress for us. The threshold version on the other hand is pretty much finished, it's hard to imagine what we could still optimize there. So is the ordered version, so maybe that too.

one issue I noticed is that you could use C++11 for the threads instead of pthreads - which for one would make it easier to port to MS Windows - which OpenMS needs to do. Also, for us a key question is whether we can calculate multiple molecules at once because I think parallelization at this level is more interesting for clients. However, for this all of IsoSpec needs to be thread safe and there cannot be any global variables.

MatteoLacki commented 5 years ago

Hi,

Also practically speaking, most people are probably interested in the topN peaks or the peaks above a threshold since that is what the mass spec can measure, the "coverage of probability space" is more of theoretical interest in my view.

It is not really the case that the experimental data can provide us with a clear idea of the height of the lowest peak needed for the calculations. Usually the instrument aggregates the isotopologues because of its finite resolution. As a consequence, one would be underestimating the number of peaks necessary to represent the distribution at a nicely defined level of say 99.9% of the distribution. This might have some influence on the quality of the deisotopisation. Also, although this is more of a theoretical argument, without a test in real mass spectra, the intensity of the lowest peaks should be among the most volatile in the spectrum, exposing the method to risks of tying the results to statistical noise. I don't like that :)

I noticed is that you could use C++11 for the threads instead of pthreads - which for one would make it easier to port to MS Windows - which OpenMS needs to do.

OK, we will use C++ threads.

Also, for us a key question is whether we can calculate multiple molecules at once because I think parallelization at this level is more interesting for clients. However, for this all of IsoSpec needs to be thread safe and there cannot be any global variables.

We are 99% sure that there is currently no dependence on global variables and the code should work well in parallel.