chrischoy / fully-differentiable-deep-ndf-tf

Fully differentiable deep-neural decision forest in tensorflow
MIT License
228 stars 64 forks source link

Note that this modification breaks the decision tree formulation. #1

Closed yuvval closed 7 years ago

yuvval commented 7 years ago

There was a reason for the authors to use the alternating updates.

The modification suggested here breaks the tree formulation, and just renders it as "some" deep-net architecture. With this modification W_{leaf} does not has a probabilistic explanation, and it means that the learned weights might not play nicely as decision nodes.

chrischoy commented 7 years ago

The proposed method is probabilistic like the original paper and does not break the tree formulation. W_{leaf} is represented using a soft-max which is always in a simplex and the decision nodes follow the original paper. Please refer to the slides for more details.

yuvval commented 7 years ago

In the paper Wleaf has the interpretation of Pr(y|leaf, x). And then Pr(y|x) = Expectation{leaf|x}[Pr(y|leaf, x)]. The modification you suggested discards this notion, and try to approximate Pr(y|x) by softmax(W_leaf.' * leaf|x). However, using softmax does not guarantees that it will learn to produce the true Pr(y|x) with a single layer of W_leaf. And instead, you might need a multi-layer net, just in order to estimate the true Pr(y|x) given (leaf, x).

I can suggest a simple alternative that you can use, if you don't like to use the update rule given by the authors.

You can remove the softmax, and replace the gradient update of W_leaf with:

   # Take a gradient step on W_leaf
   leaf_grad = optimizer(*optimizer_params).compute_gradients(loss, var_list= [W_leaf])[0][0]
    updated = -leaf_grad + W_leaf
    # Projecting to [0,1] (simplex) 
    updated = tf.maximum(updated, 0) # clip to >= 0
    # Normalize each row sum to 1
    epsilon = 1e-7
    updated = tf.transpose(tf.div(tf.transpose(updated), epsilon + tf.reduce_sum(updated, axis=1)))
    leaf_update_op = tf.assign(self.kernel, updated)

For your loss, you can instead use: keras.losses.categorical_crossentropy (instead of the softmax cross entropy)

chrischoy commented 7 years ago

The implementation I proposed in this repository does not discard any of the interpretation in the original paper.

First, I'll define the notations as there is no such thing as W_l in the original paper.

In the paper Eq.1, $\pi$ is the leaf node and is defined as a probability distribution (any distribution that has the right dimension) and $W_l$ is a parameter I introduced in this repository, which parametrizes a leaf node probability distribution $\pi$. $W_l$ is any vector and if you take a softmax of the $W_l$, you always get a valid probability distribution. So instead of using a bare probability distribution, I just used a parametric probability distribution that has the same degree of freedom as the bare probability distribution. That is the only change I made in the original formulation and it does not break any interpretation.

This parametrization $\pi$ however allows us to train $\pi$ as well and it replaces the alternating optimization step involving the Eq.11 with a gradient descent step.

This is justifiable as the original formulation does not involve latent variables. To elaborate, an alternating optimization is necessary or required if you have a latent variable that depends on the other variables and gives a good closed form solution. Well-known examples of such alternating optimization with closed form solutions are EM steps in the graphical model inference and ADMM. However, there is some type of problems that can be solved using joint optimization: neural networks. If you think all the layer parameters in a neural network as latent variables, you need the layer-wise alternating optimization like many people did for RBM. However, we can jointly optimize all parameters simply because the gradient computation is fast and cheap. This basic idea applies to the Neural Decision Forest.

I think you misunderstood the implementation or my slides and if you can elaborate why you think I'm try[ing] to approximate Pr(y|x) by softmax(W_leaf.' * leaf|x), though I'm not approximating anything, I'd be happy to clarify your misconception.

Tldr; This is the exact formulation following the Eq.1 and the differences proposed in this repository are: 1. use parametric function for $\pi$ ($\pi = softmax(W)$ where $W$ is a vector) and 2. update the $\pi$ using stochastic gradient descent instead of Eq.11 and thus removing alternating optimization.

yuvval commented 7 years ago

I think I disagree with you about the notations. The authors notations in the paper are confusing. I will try to clarify the notations as I understand them:

  1. \pi{ly} means Pr(Y=y | leaf=l), which is a matrix that its parameters are learned according to the new update rule. In my previous comment I also noted it as W_leaf (I will explain why below)

  2. mul(x|\theta) is actually, Pr{\theta}(leaf=l | x). This is the output of the products of decision-sigmoids

As in Eq (1): Pr(Y=y|x) = \sum{l}[ Pr(Y=y | leaf=l) * Pr{\theta}(leaf=l | x) ] = \sum_{l}[ \pi{ly}*mul(x|\theta) ] = dotproduct (\pi, Pr{\theta}(leaf | x) ). This notation ensures that the mapping from leaf to Pr(y|x) models the exact "true" probability of mapping leaves to classes.

Now note the following: Since \mu=Pr{\theta}(leaf | x) is a vector of the outputs of the (product of) decision sigmoids then Pr(Y=y|x) is actually an output of a fully connected layer coming from Pr{\theta}(leaf=l | x) through \pi (This is the reason I noted it as W_leaf ). Therefore, you can take gradient steps directly on \pi, without the need to use the softmax . The only thing you need to maintain, is projecting \pi to the simplex ([0,1]) and making sure its rows sum to 1.

Now, I'll try to connect it to my previous comment. In the modification you suggested, Pr(Y=y|x) = softmax( dotproduct (Wleaf, Pr{\theta}(leaf=l | x) ) ). Altough it produces a vector which sums to 1, it does not represents correctly the mapping from leaf to Pr(y|x), which is Pr(Y=y|x) = \sum{l}[ Pr(Y=y | leaf=l) * Pr{\theta}(leaf=l | x) ].

I am not sure why the authors chose to use their special "alternating" update, instead of taking a gradient step and make a projection of \pi to the simplex. I will soon test that on my real-data. I will update you if I will have any conclusion. Currently, both approaches for the update work for me on a small scale dataset I use.

Cheers

chrischoy commented 7 years ago

Hmm. I'm not sure how you can disagree with the notation defined by the authors. To me, the notation is very clear and I don't agree with that The authors notations in the paper are confusing.

  1. The mathematical notation does not necessarily have to follow the convenience of the implementation. I implemented the terminal node probability using a matrix out of convenience, but it is not a necessary condition. You coud implement using a list of vectors, which would be exactly what the notation in the paper proposed.

  2. Yes, that is the routing probability.

Yes, Therefore, you can take gradient steps directly on \pi, without the need to use the softmax . The only thing you need to maintain, is projecting \pi to the simplex ([0,1]) and making sure its rows sum to 1. That is the main idea of what I proposed in this repository.

Regarding In the modification you suggested, Pr(Y=y|x) = softmax( dotproduct (W_leaf, Pr_{\theta}(leaf=l | x)) ).: No I'm not doing that. See the line https://github.com/chrischoy/fully-differentiable-deep-ndf-tf/blob/github/demo_fully_diff_ndf.py#L251.

I finally understand what you are trying to say and the answer is No, I'm not breaking the original formulation. I am closing the issue for invalid question.