david-cortes / isotree

(Python, R, C/C++) Isolation Forest and variations such as SCiForest and EIF, with some additions (outlier detection + similarity + NA imputation)
https://isotree.readthedocs.io
BSD 2-Clause "Simplified" License
192 stars 38 forks source link

some-versus-many distance computation #1

Closed zkurtz closed 4 years ago

zkurtz commented 4 years ago

predict_distance appears to currently support only returning O(n^2) distances, which is not scalable. Could you add an option to pass in two data frames, (X=n x k, Y=m x k) instead of one, such that the returned distances are of dimensionality n x m? Even a one-versus-all option would be very useful.

david-cortes commented 4 years ago

I definitely plan to add it in a future version, but it won’t happen soon (at least not this month).

If you really need it for some reason, the way to add it would be to modify the C++ functions increase_comb_counter called from traverse_tree_sim and traverse_hplane_sim - basically need to add a new increase_comb_counter which wouldn’t iterate over all pairs, but only over the desired combinations, which would have to be set through row indices in ix_arr being above or below some number (e.g. the X groups having the earlier numbers and the Y groups having the later ones), plus a new ‘if’ condition that returns from *_sim if there are only observations of one group.

(It’s quite a lot of work though)

david-cortes commented 4 years ago

This is now implemented in the master branch.