PyProphet / pyprophet

PyProphet: Semi-supervised learning and scoring of OpenSWATH results.
http://www.openswath.org
BSD 3-Clause "New" or "Revised" License
29 stars 21 forks source link

Question about calculating scores based the LDA scalings_ #71

Closed mantouRobot closed 5 years ago

mantouRobot commented 5 years ago

Dear @grosenberger , @uweschmitt , @hroest ,

After LDA fitted the train data, we need to score the test data using the LDA model params. As I know, there are two methods to calculate the scores:

  1. "LinearDiscriminantAnalysis().transform()". This function transforms the features to the new small subspace. In fact, it scores like this: np.dot(X - lda.xbar, lda.scalings)

  2. LinearDiscriminantAnalysis().predict(). In detail, this function determines the classification based on: np.dot(X, lda.coef.T) + lda.intercept

reference can be found here

But in the file of 'classifiers.py', the function of 'score()' is just _'clfscores = np.dot(X, lda.scalings)' . Incomprehensibly, in the function of '_start_semi_supervisedlearning', the _clf_scores -= np.mean(clf_scores)_ (the mean of clf_scores is not alway zero?), but in the function of '_iter_semi_supervisedlearning', the clf_scores does not minus the mean of itself.

In conclusion, I doubt the score formulation's correctness used by pyprophet. Is it should like this np.dot(X - lda.xbar, lda.scalings) instend of np.dot(X, lda.scalings)? Maybe these two methods don't make much difference to the final result in the end.

Thanks.

uweschmitt commented 5 years ago

You can rewrite np.dot(X - lda.xbar, lda.scalings) as np.dot(X, lda.scalings) - np.dot(lda.xbar, lda.scalings) and the second term is a constant value. Thus we only shift the results here, which is later compensated when we later compute the q-values.

mantouRobot commented 5 years ago

Dear @uweschmitt ,

Thanks for your reply. But I do not understand this shift with be compensated later is meaningful. Why not directly use the raw transform function as the score?

Thanks.

uweschmitt commented 5 years ago

We did not use the transformation for historical reasons. The predecessor mProphet did not subtract the mean and we wanted reproducible results when we rewrote mProphet in Python. Finally the q-value is invariant in regard of this difference. The compensation happens in the pnorm function.

mantouRobot commented 5 years ago

@uweschmitt In pnorm function, what's pyprophet does is subtracting the mean of decoy peaks. But lda.xbar means the mean of all samples. That means it doesn't make sense mathmatically.

uweschmitt commented 5 years ago

Exactly. You subtract the mean. As I posted before your modification introduces a constant shift for all scores. So what happens if you shift a set of scores by a constant value and later substract the mean? You get the same values as without shifting.

You can also try this out: implement your modification and compare new vs old results.

mantouRobot commented 5 years ago

Dear @uweschmitt , you are right. I try the modification and the result are the same. Now I understand that the score's absolute value is not the key point because we will convert the score to pvalue based on the distrubution of scores. In this way, the entire scores shift will not make difference to the result.

Thanks again.