log prod of different features override each other

liufuyang commented 5 years ago

Perhaps we shouldn't add the log prod of each features directly onto each other before normalization.

For example, a very small sigma on a gaussian feature will create a very large negative value of log prob of this feature (before adding it to prior and so the normalization)

See the last two test examples on this https://github.com/Tradeshift/blayze/pull/16

I think perhaps we need to calculate the log prob for each single feature, add it to it's prior for this feature, normalize through all the outcome of it, then do this for other features as well, then sum up the prob for each outcome and normalize again.

rasmusbergpalm commented 5 years ago

No, we should not weight the features differently. Bayes theorem and the laws of probability tells us how to accumulate evidence or information in an unbiased way. If you do anything else, it will be wrong.

A very small variance indicates that this gaussian feature is very certain, so it should weight a lot. That is actually the beauty of naive bayes, that the features are automatically weighted by how certain they are.

Consider revisiting the document we started writing: https://v1.overleaf.com/15668815tndnymyjggty#/59556533/

liufuyang commented 5 years ago

But won't you think it is very counter intuitive between two features? Please take a look at my testing examples, if you have a category feature predicting result is A(with prob 0.99), and a gaussian feature predicting it is B (with prod 0.99), then putting them together will predicting B (with prob 0.99), but while separating the feature giving you one A one B. Isn't this very not Naive?

liufuyang commented 5 years ago

Just as the test I write describes, if you train a model with following data/evidence

Gaussian: 
hight: 110 => child
hight: 90 => child
hight: 179 => adult
hight: 181 => adult

Category:
have_driver_license: no => child
have_driver_license: yes => adult

Then you test with just hight on the model gives child:

hight:140 => will output child with prob close to 1.0, because child has a high variance

This is okay,

Then you test with just have_driver_license the model gives adult:

have_driver_license: yes  => will output adult with prob also high

This is also okay,

But when you test with two features together, the model gives child

hight:140
have_driver_license: yes 

=> will give output child, as log sigma generate large negative gaussian log prob  on adult class

Do you think this behavior looks good or normal? Does it mean the feature hight and have_driver_license become dependent with each other?

rasmusbergpalm commented 5 years ago

No I think it makes perfect sense.

In the last example, categorical vs gaussian, you've added the "warm"/"cold" features twice. With a pseudo count of 1.0 (the default) that means that all other thing being equal, there should be 2/3 chance of t-shirt in warm weather and 2/3 chance of sweater in cold weather.

The gaussian feature on the other hand have seen 19 and 21, and -21 and -19. They are tightly centered around +- 20 degrees.

https://colab.research.google.com/drive/1Ynd6dCuMdisXOs67uLHU25b6HtXiCLBb#scrollTo=2mKn-5bzFZTq

At 20 degrees, there's almost 0 probability of it being sweater weather according to the gaussian feature. So it is much more certain than the categorical feature, which has 1/3 probability of t-shirt weather in "cold".

Now, you can argue what the pseudo count should be. As you move it towards zero the categorical feature will become more and more certain, eventually becoming more certain than the gaussian one.

liufuyang commented 5 years ago

@rasmusbergpalm see my above comment, or change the pred input of temp on the above case to 0.1 or something. My input 20 degree was wrongly setup, it was not my intension. let me update the test code soon.

rasmusbergpalm commented 5 years ago

Please don't change the premise. Do you understand why it makes sense for the original data you gave?

Feel free to play around with the colab notebook: https://colab.research.google.com/drive/1Ynd6dCuMdisXOs67uLHU25b6HtXiCLBb

liufuyang commented 5 years ago

Yeah I know that distribution plot. It just my sweater example by chance fit the situation of gaussian feature is more useful in that case. Then how about my driver license case above then? I can put some code here as well to show you our current setup always diminish the categorical feature's weight.

liufuyang commented 5 years ago

How about this test, don't you think a potentially very useful categorical feature is not taken into account at all?

    @Test
    fun different_features_should_be_weighted_equally_by_default_test_gaussian2() {
        val model = Model().batchAdd(
                listOf(
                        Update(Inputs(
                                categorical = mapOf(Pair("have driverlicense", "no")),
                                gaussian = mapOf(Pair("height", 95.0))),
                                "child"),
                        Update(Inputs(
                                categorical = mapOf(Pair("have driverlicense", "no")),
                                gaussian = mapOf(Pair("height", 105.0))),
                                "child")

                )).batchAdd(
                listOf(
                        Update(Inputs(
                                categorical = mapOf(Pair("have driverlicense", "yes")),
                                gaussian = mapOf(Pair("height", 179.0))),
                                "adult"),
                        Update(Inputs(
                                categorical = mapOf(Pair("have driverlicense", "yes")),
                                gaussian = mapOf(Pair("height", 181.0))),
                                "adult")
                )
        )

        var predictions = model.predict(Inputs(
                categorical = mapOf(Pair("have driverlicense", "yes"))
        ))
        println(predictions)
        // {child=0.25, adult=0.7499999999999999}

        predictions = model.predict(Inputs(
                categorical = mapOf(Pair("have driverlicense", "yes")),
                gaussian = mapOf(Pair("height", 140.0))
        ))
        println(predictions)
        // {child=1.0, adult=2.552761305287827E-166}

        assertEquals(0.5, predictions["child"]!!, 2e-1) // fail here...
        assertEquals(0.5, predictions["adult"]!!, 2e-1)

    }

liufuyang commented 5 years ago

The interesting part is when you predict only with a feature, you get different predictions. but when you put the features together, the prediction is much more dominated by gaussian feature.

I think this will make the categorical feature basically useless, when combined with multinomial features or gaussian features.

Think of if you have multinomial and categorical feature at the same time, and let's say your multinomial feature is a big text with lots of words, then the categorical feature will be sadly meaningless to have as it is basically just considered a count 1 on a single word.

I feel in some cases where people might want to weight those features equally (or maybe even tunable later), especially when a categorical feature is a strong feature. So if we could do something then they don't have to use our model and predict only with a feature each time then normalize again on their side.

rasmusbergpalm commented 5 years ago

Did you read my comment on the pseudo count parameter?

liufuyang commented 5 years ago

closing as the new version helped a lot in situations like this. Will try later use some data set like https://archive.ics.uci.edu/ml/datasets/Adult to evaluate the model again.

Tradeshift / blayze

log prod of different features override each other #18