Random forest runs slower on sparse input

This can be observed for example on the sparse synthetic datasets e.g. src/test/data/synth/synth_2000_500_fact_10_0.995-wide.csv

The reason seem to the that the very sparse data result in very deep and unbalanced trees (with for example 104 levels rather than 8 for dense data). Because od the sparsity there is almost always a split that separates zeros from a few non zero values. This split is usually very uneven (that is the zero side has significantly more elements). On the next level the zero split is likely to be split again in the same manner with different variable. This results in progression of one sided splits that cut a smal portion of non zero samples at each level, going very deeply.

Like (the size of the sample set, final splits marked wiht ) [1000] [995,5] [990,5][5] [985,5][5][5] ....

I think this may can possibly the the case for genomics variant data (as they are very likely to be sparse, especially if not filtered for MAF).

I am not sure what the impact on the importance is but is may definitely adversely impact runtime performance. Is may be beneficial to consider limiting the depth of the tree possibly with the mininum gini gain to consider the split.

aehrc / VariantSpark

Random forest runs slower on sparse input #108