Why are my results so different on identical runs?

Hi, I apologise if this is a stupid question, but I am using CRFsuite for IOB labelling and when running the same experiments identically in 3 trials, the results are (sometimes, not always) very different per run. In some instances, standard deviation of f1-scores is over 5% for the three runs.

For each run, I am using the exact same training and test set (which are completely separate). I do use cross-validation for hyperparameter optimisation, but I set the random_seed there to avoid changes between runs. So basically, I do the following with identical data 3 times:

grid_search = GridSearchCV(crf, hyperparam_search_space, scoring=scorer, verbose=True, cv=KFold(nr_folds, random_state=42)) grid_search.fit(x_train, y_train) optimised_crf = grid_search.bestestimator y_pred = optimised_crf.predict(x_test) final_score = metrics.flat_f1_score(y_test, y_pred, average='macro', labels=["I", "O", "B"])

to illustrate, these are results from 3 identical runs on identical data: Example 1: f1 (micro): 83.2%, 81.6%, 66.2% f1 (macro): 71.8%, 71.6%, 57.5%

Example 2: f1 (micro): 81.1%, 77.6%, 66.7% f1 (macro): 53.5%, 57.3%, 47.1%

The differences are not always this large (and when they are, it is often due to one of the runs which as a much lower score). Micro f1 scores are also more stable than macro f1 scores (data is imbalanced, so there are sometimes only 10% I labels for instance).

So my questions are:

why are the differences sometimes this large, when the exact same data is used, with the same shuffle for hyperparameter optimisation?
which random_seeds need to be set to stabilise these results?

thank you!

chokkan / crfsuite

Why are my results so different on identical runs? #118