EvgeniDubov / hellinger-distance-criterion

Random Forest model using Hellinger Distance as split criterion
BSD 3-Clause "New" or "Revised" License
31 stars 12 forks source link

Multiple Outputs case probably getting overlooked #4

Open harish1996 opened 5 years ago

harish1996 commented 5 years ago

In the children_impurity part of the code

        for k in range(self.n_outputs):
            if(sum_left[0] + sum_right[0] > 0):
                count_k1 = sqrt(sum_left[0] / (sum_left[0] + sum_right[0]))
            if(sum_left[1] + sum_right[1] > 0):
                count_k2 = sqrt(sum_left[1] / (sum_left[1] + sum_right[1]))

            hellinger_left += pow((count_k1  - count_k2),2)

            if(sum_left[0] + sum_right[0] > 0):    
                count_k1 = sqrt(sum_right[0] / (sum_left[0] + sum_right[0]))
            if(sum_left[1] + sum_right[1] > 0):
                count_k2 = sqrt(sum_right[1] / (sum_left[1] + sum_right[1]))

              hellinger_right += pow((count_k1 - count_k2),2)

The above for loop over multiple outputs is probably repeatedly calculating the same thing again and again.

In the entropy criterion, i found this piece of code responsible for moving sum_right and sum_left to the next set of outputs.

            sum_left += self.sum_stride
            sum_right += self.sum_stride

I beleive that is necessary at the end of the for loop, to move the array sum_left and sum_right to the next output column's sum_right and sum_left.