Importance is slightly biased towards last variables.

piotrszul commented 5 years ago

The procedure of selecting split variables in case of equal reduction in impurity is slightly biased towards variables with larger indexes. In the previous non-reproducible approach it was casused by the increased probablilly of selecting later variables. In the current one it is probably cause by not enough randomness in using XOR as hashing function. The solution is to use a better hashing function to generate a surrogate order and to vary it on only per batch and partition but also for every split. Mumur3 hashing seem to be a good candiate. Here is the code snippet:

Murmur_Snippet.txt

piotrszul commented 5 years ago

Here is in interesting info on randomness of various hashing algorithms: https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed

rocreguant commented 8 months ago

It seems to be implemented: https://github.com/aehrc/VariantSpark/commit/19509549fd18e581e3dddea56d52e7f420117157

aehrc / VariantSpark

Importance is slightly biased towards last variables. #107