Closed bretttully closed 6 years ago
Thanks for your countribution !
What part of scikit-learn
we are using is randomized ? AFAIK we only use LDA which is deterministic and should always provide the same result for fixed input data.
Hi @uweschmitt -- that is likely my mistake then. I was running OpenSWATH+PyProphet multiple times on the same input file and continually getting slightly different answers even when setting the random seed. I assumed there must be some stochastic element to the PyProphet implementation, and had assumed (probably incorrectly) that this was an interaction between np.random
and random
through sklearn.
One of the other sources of randomness I found was the ordering of the results from OpenSWATH. If you run this in threaded mode, the output has the same content from run to run, but is in a different order (I assume there are race conditions between the threads). So you have to sort the OpenSWATH file before you pass it in to PyProphet as well. Perhaps this last step is sufficient and this pull request isn't needed?
@bretttully But your fix removed the randomness ? If yes we have to investigate this further.
@hroest @grosenberger can you comment on the ordering issue please ?
The combination of the two fixes did remove the inconsistent results; I didn't test the fixes independently -- I can look in more detail next week.
Yes, if OpenSWATH is run with multiple threads, the results are ordered differently. @bretttully did you have a chance to check whether the patch is needed when PyProphet is applied to the same OpenSWATH report twice in debug mode?
@bretttully Did you have a chance to look again into this issue? Could you resolve the problem without the fix or is still required?
From what I can tell, so long as the results are sorted before running pyprophet, its output is consistent without this change.
Thanks for the information. I will then close the PR for the moment.
@grosenberger with the new sqlite format, will that be inserted in a consistently sorted order? Or will it also suffer from the race condition of different threads?
The data is still inserted in a random order by OpenSWATH, but I will implement sorted querying by PyProphet.
In order to ensure that setting the random seed filters through to sklearn, we need to make sure that np.random is used everywhere. This now means that if you run pyprophet on the same input file, and set the random seed, you will get the same output. Before this change, that was not the case.