Closed bansalism2 closed 4 years ago
Hi @bansalism2 , could you provide some more information with examples?
Hi Nirmal, Definitely. So the way current PMML implementation predicts the anomalies is based on the average anomaly score from all the trees. But the actual implementation in sklearn takes many more things into consideration while predicting any observation as anomaly. Citing the code lines from sklearn class for scores calculation:
scores = 2 ** ( -depths / (len(self.estimators_)
where _average_path_length is further calculated as: average_path_length[not_mask] = ( 2.0 * (np.log(n_samples_leaf[not_mask] - 1.0) + np.euler_gamma)
Hence the outputs we see from sklearn will never match with PMML implementation. I think that the PMMl implementation should follow the same scoring logic as sklearn itself.
P.S. - Sklearn scores can go negative (due to offset) while PMML scores will never go negative by design
Hi @bansalism2 , PMML also follows the same scoring procedure as scikit-learn does. Please refer to official PMML documentation of Anomaly Detection Model. You will find the anomaly calculation formula using the average anomaly score
and sampleDataSize
. I hope this answers your question.
Note - The average anomaly score that you get from the trees in PMML is not the final result. It always goes through the calculation as you have already mentioned.
Thanks @Nirmal-Neel
Originally while predicting from sklearn the scores contain values lesser than 0 as well while the PMML generated from NYOKA converts everything to positive. Hence the values are highly mismatched