Issue with sklearn isolation Forest

bansalism2 commented 4 years ago

Originally while predicting from sklearn the scores contain values lesser than 0 as well while the PMML generated from NYOKA converts everything to positive. Hence the values are highly mismatched

Nirmal-Neel commented 4 years ago

Hi @bansalism2 , could you provide some more information with examples?

bansalism2 commented 4 years ago

Hi Nirmal, Definitely. So the way current PMML implementation predicts the anomalies is based on the average anomaly score from all the trees. But the actual implementation in sklearn takes many more things into consideration while predicting any observation as anomaly. Citing the code lines from sklearn class for scores calculation:

scores = 2 ** ( -depths / (len(self.estimators_)

_average_path_length([self.maxsamples])) )

where _average_path_length is further calculated as: average_path_length[not_mask] = ( 2.0 * (np.log(n_samples_leaf[not_mask] - 1.0) + np.euler_gamma)

2.0 * (n_samples_leaf[not_mask] - 1.0) / n_samples_leaf[not_mask] )

Hence the outputs we see from sklearn will never match with PMML implementation. I think that the PMMl implementation should follow the same scoring logic as sklearn itself.

P.S. - Sklearn scores can go negative (due to offset) while PMML scores will never go negative by design

Nirmal-Neel commented 4 years ago

Hi @bansalism2 , PMML also follows the same scoring procedure as scikit-learn does. Please refer to official PMML documentation of Anomaly Detection Model. You will find the anomaly calculation formula using the average anomaly score and sampleDataSize. I hope this answers your question.

Note - The average anomaly score that you get from the trees in PMML is not the final result. It always goes through the calculation as you have already mentioned.

bansalism2 commented 4 years ago

Thanks @Nirmal-Neel

SoftwareAG / nyoka

Issue with sklearn isolation Forest #30