biocore / taxster

taxster: assigning taxonomy to organisms you've never even heard of
BSD 3-Clause "New" or "Revised" License
2 stars 4 forks source link

flat priors? #4

Closed audy closed 8 years ago

audy commented 10 years ago

Bayesian classifiers estimate class priors based on class frequencies in the training data. They then use these priors to calculate class probability when predicting. Therefore, predicted class probabilities depend on the number of representative sequences in the database. This creates two problems. First, the first is that predicted class labels are dependent on the frequency of class labels in the training data. Second, cross-validation performance may be overstated due to the training and testing data having the same distribution of class probabilities.

For example,

Let's say there exists a 16S rRNA database containing sequences for two species of closely-related bacteria: Bacteriabacter wellknown and Bacteriabacter justdiscovered. B. wellknown has 1,000 representative 16S sequences and B. justdiscovered only has 1. If you were to try to classify an unknown sequence (actually B. justdiscovered), it will have a lower probability for B. wellknown than B. justdiscovered even though it is more similar to B. justdiscovered. In this case you would get a more accurate result if you fitted the Bayesian classifier using flat priors.

This problem also arises when evaluating Bayesian classifiers.

If you were to perform cross-validation using a Bayesian classifier and the testing dataset had the same class distribution as the training data, then the accuracy would be inflated due to the fact that the class distributions were similar. In order to truly evaluate the Bayesian classifier, you would have to select classes for the testing dataset randomly or through some other distribution.

Just an idea. I thought I'd write it down in an issue in case I forget to bring it up during the next call.