RaviSoji / plda

Probabilistic Linear Discriminant Analysis & classification, written in Python.
https://ravisoji.com
Apache License 2.0
128 stars 31 forks source link

How to get scores from PLDA #48

Closed alexgomezalanis closed 5 years ago

alexgomezalanis commented 5 years ago

@RaviSoji How could I get the score which the PLDA assigns to a test input vector?

Right now I am using calc_logp_pp_categories for getting scores, but they are log probabilities and not raw scores.

RaviSoji commented 5 years ago

Thanks for writing!

By scores, do you mean the latent features? If so, check out cell 28 in the MNIST demo. I think the following line is what you are looking for:

U_model = classifier.model.transform(training_data, from_space='D', to_space='U_model')

Let me know whether this is or isn't what you are looking for, and then we can go from there.

alexgomezalanis commented 5 years ago

Yes, I am obtaining U_model in that way, and then I am calculating the log probabilities using U_model as input. Does it make sense to you?

alexgomezalanis commented 5 years ago

I am also very curious about the normalisation that you apply in calc_logp_pp_categories. Does it make the normalisation using only the training data?

alexgomezalanis commented 5 years ago

@RaviSoji

Just to put you in context, I am using this classifier for detecting spoofing attacks. I need to get the scores of genuine utterances and spoofing attacks utterances.

This is how I am using it:

z = np.concatenate((z_genuine, z_spoof)) y = np.concatenate((Y_genuine, Y_spoof)) U_model = clf.model.transform(z, from_space='D', to_space='U_model') scores, K = clf.calc_logp_pp_categories(U_model, False) print(getEER(scores, y))

RaviSoji commented 5 years ago

It looks like by scores, you mean the unnormalized log densities of test data being generated by each category in the training set. In that case, the following line that you wrote looks correct: scores, K = clf.calc_logp_pp_categories(U_model, False).

If you normalize the log densities, for each test datum, you can think of the normalization being done by "(1) exponentiating the log density under each category, (2) summing up those densities, and (3) then dividing all of the densities for that datum by this total density". Of course, for numerically stability, this is done in log space.

Let me know if that isn't clear!

RaviSoji commented 5 years ago

I just took another look at this and realized that I may still be interpreting the question incorrectly. If all you want to do is classify new data as "spoof" or "not spoof" and obtain the accompanying probabilities, use the predict() method in the classifier, and set normalize_logpps to True:

predict(data, space='D', normalize_logps=True)

This will save you the effort of having to transform the data manually and it will return log probabilities.

The equations are actually written in the docstrings, so I am going to close this issue for now, but feel free to write back if you still need help!

alexgomezalanis commented 5 years ago

Yes, all makes sense to me now. Thanx for your prompt responses and for this library!

RaviSoji commented 5 years ago

You're welcome, and good luck!