aehrc / VariantSpark

machine learning for genomic variants
http://bioinformatics.csiro.au/variantspark
Other
140 stars 45 forks source link

Importance is slightly biased towards last variables. #107

Closed piotrszul closed 8 months ago

piotrszul commented 5 years ago

The procedure of selecting split variables in case of equal reduction in impurity is slightly biased towards variables with larger indexes. In the previous non-reproducible approach it was casused by the increased probablilly of selecting later variables. In the current one it is probably cause by not enough randomness in using XOR as hashing function. The solution is to use a better hashing function to generate a surrogate order and to vary it on only per batch and partition but also for every split. Mumur3 hashing seem to be a good candiate. Here is the code snippet:

Murmur_Snippet.txt

piotrszul commented 5 years ago

Here is in interesting info on randomness of various hashing algorithms: https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed

rocreguant commented 8 months ago

It seems to be implemented: https://github.com/aehrc/VariantSpark/commit/19509549fd18e581e3dddea56d52e7f420117157