EpistasisLab / scikit-rebate

A scikit-learn-compatible Python implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning.
https://EpistasisLab.github.io/scikit-rebate/
MIT License
408 stars 73 forks source link

Weights are different for different runs #76

Closed moumitam28 closed 3 years ago

moumitam28 commented 3 years ago

Hi ,

I'm new to ML and was implementing this code to check how this works. The feature scores were constant at first, but now they are changing in every run. As per my understanding they should always be same as I don't have any dependency on classifier. Could you please confirm if this is expected?

Code : features, classes = df.drop('Class', axis=1).values, df['Class'].values X_train, X_test, y_train, y_test = train_test_split(features, classes)

arr = X_train.astype('float64') fs = ReliefF() fs.fit(arr, y_train)

top_n=[] names=[] for feature_name, feature_score in zip(df.drop('Class', axis=1).columns, fs.featureimportances): top_n.append(feature_score) names.append(feature_name)

a = pd.DataFrame(top_n) b =pd.DataFrame(names)

info = pd.concat([a,b], axis=1) info.columns = ['Score','Features']

top = info.nlargest(50,'Score') ft= np.array(top['Features']) ft

ryanurbs commented 3 years ago

I'm pretty sure you are getting different scores because the train_test_split() function shuffles the instances before the split. This means you'd be training on a different set of instances each time. Further the order of the instances in the dataset can make a difference regarding distance tie-breakers when running ReliefF also giving different scores. Otherwise the scoring should be deterministic, i.e. the same each run.

moumitam28 commented 3 years ago

Thanks a lot for pointing that out. Just checked if train_size/test_size is not explicitly mentioned in train_test_split() function, it takes 0.25 by default. Somehow for 1st few runs split was same, but after few runs split was different so the scoring varied. Fixed the code, it is working as expected.