Open ClaudioSalvatoreArcidiacono opened 1 day ago
I have submitted a PR with my suggestions, let me know if you would like me to make changes :)
Hi @ClaudioSalvatoreArcidiacono
Thanks for the issue.
If you set the random_state
in the LogisticRegression
, do you still see this behaviour?
Describe the bug The behavior of SmartCorrelatedSelector is unpredictable when there are features that are very similar, so similar that they have the same number of missing values or the same single feature model perforamnce.
In particular I have noticed that running the exact same code, with same python version and package versions on my development machine (a mac m1) and on a different machine (a linux-based remote node) I get different results.
To Reproduce Run the code below on two different machines:
On one machine I get:
Whether on another machine I get:
Notice how the dropped features mismatches in the two sets.
For the tests above I used the following packages versions:
Expected behavior I would expect the two runs to give the exact same results.
Screenshots N/A
Additional context I have already implemented an alternative version of
SmartCorrelatedSelector
that does not have this issue and I would like to contribute to the project by sharing my version.There are 2 reasons for the issue above
quicksort
. Thequicksort
sorting algorithm is not stable (see pandas doc), meaning that in cases of ties it does not keep the original feature order. To solve that I have added the parameterkind="mergesort"
to every call to pd.Series.sort_values.mergesort
is a stable sorting algorithm and it ensures the same ordering, also in case of ties.selection_method='model_performance'
, the temp_set is here defined as a set, which is a collection that does not preserve order. When this value is returned and it is used in here the original order of the feature is not preserved anymore, when the features are finally sorted in here even withmergesort
as a sorting algorithm the result will differ in case of ties (because the original order is not preserved due to the set issue). To solve this second point I have changed thetemp_set
variable to be alist
instead of aset()