Closed piotrszul closed 8 months ago
Here is in interesting info on randomness of various hashing algorithms: https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
It seems to be implemented: https://github.com/aehrc/VariantSpark/commit/19509549fd18e581e3dddea56d52e7f420117157
The procedure of selecting split variables in case of equal reduction in impurity is slightly biased towards variables with larger indexes. In the previous non-reproducible approach it was casused by the increased probablilly of selecting later variables. In the current one it is probably cause by not enough randomness in using XOR as hashing function. The solution is to use a better hashing function to generate a surrogate order and to vary it on only per batch and partition but also for every split. Mumur3 hashing seem to be a good candiate. Here is the code snippet:
Murmur_Snippet.txt