david-cortes / isotree

(Python, R, C/C++) Isolation Forest and variations such as SCiForest and EIF, with some additions (outlier detection + similarity + NA imputation)
https://isotree.readthedocs.io
BSD 2-Clause "Simplified" License
192 stars 38 forks source link

Sample weight explaination #14

Closed tararae7 closed 3 years ago

tararae7 commented 4 years ago

Hi David, (mistype in the header...i meant column weights not sample weights)

I don't believe there is a full explanation on how the column_weights parameter gets applied in the isotree model. I understand that if i have 5 features i can pass a list to this parameter such as (5,2,3,4,7) in this case my fifth has the highest weight but what does that actually do in the model? Also, the help for this parameter says "Ignored when picking columns by deterministic criterion". How do you pick columns by the deterministic criteria? Is that the extended model? Thank you!

david-cortes commented 4 years ago

It works as follows: when it chooses a column at random, it will choose each with a probability given by the weight of the column divided by the sum of the weights of the remaining splittable columns (i.e. having at least 2 different values, not having already split by it if using ndim > 1).

Columns are picked by a deterministic criterion when passing ndim=1 and using pooled/averaged gain criterion.

tararae7 commented 4 years ago

Thank you so much for responding David. I have questions regarding your answer here. If I sets weights such as this (1,5,1) then from your explanation the second feature would have the following probability of being randomly picked like 5/2=2.5*100=250%. Is that correct? Does that only apply to the first feature split in each tree? Please help me understand. If there is documentation explaining this please let me know and i can go there.

david-cortes commented 4 years ago

If you pass weights (1,5,1), then the probabilities are: (1/7, 5/7, 1/7).