Unexpected Randomness in new SentenceLabelDataset (or BatchHardLoss)?

datistiquo commented 3 years ago

Hi,

I tried the SentenceLabelDataset together with the BatchHardLoss and optuna. I compared the best found parameters of optuna seperately but cannot get same results as it should. I did this comparison for several other things like using Contrastive Loss and always get same numbers. I wonder where this "new" randomness comes from? I suggest it hast something to be with the above Dataset. Although it just uses np.random, which should be handled anyway with below seed I don't see why....

So I hope @nreimers you can have a look?

So my optuna example looks like

def objective(trial):

    SEED = 50
    torch.manual_seed(SEED)
    np.random.seed(SEED)

    from sentence_transformers.datasets import SentenceLabelDataset

    model_name = 'bert-base-german-cased'  

    train_batch_size = trial.suggest_categorical("Batch", [8, 16, 32,64, 128])
    num_epochs = trial.suggest_categorical("Epochs", [1, 2, 3])
    warm = trial.suggest_uniform("warm", 0.0, 1.0)
    warmup_steps = math.ceil(len(train_dataloader) * num_epochs * warm)
    margin = trial.suggest_uniform("margin", 0.0, 2.0)

    train_data_sampler = SentenceLabelDataset(train_samples)
    train_dataloader = DataLoader(train_data_sampler, batch_size=train_batch_size)

    model = SentenceTransformer(model_name) 

    distance_metric = BatchHardTripletLossDistanceFunction.cosine_distance
    train_loss = losses.BatchHardTripletLoss(model=model, margin=margin, distance_metric = distance_metric)

    # Train the model
    model.fit(train_objectives=[(train_dataloader, train_loss)],
              evaluator=seq_evaluator,
              epochs=num_epochs,
              warmup_steps=warmup_steps)

    best_score = max(evaluator.bests)  
    best_seed.append(best_score)

  best_score = max(best_seed)

  return best_score

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=500, timeout=None)

Then I compare the found parameters independently and should give same results.


SEED=50
torch.manual_seed(SEED)
np.random.seed(SEED)

from sentence_transformers.datasets import SentenceLabelDataset

model_name = 'bert-base-german-cased'  

# MRR@10: 0.5241730279898219
# PARAMS:  {'Batch': 32, 'Epochs': 1, 'warm': 0.8444353672269882, 'margin': 1.9155052686691523}

num_epochs = 1
train_batch_size = 32
margin = 1.9155052686691523
warm = 0.8444353672269882

train_data_sampler = SentenceLabelDataset(train_samples)
train_dataloader = DataLoader(train_data_sampler, batch_size=train_batch_size)

model = SentenceTransformer(model_name)

distance_metric = BatchHardTripletLossDistanceFunction.cosine_distance
train_loss = losses.BatchHardTripletLoss(model=model, margin=margin, distance_metric = distance_metric)

warmup_steps = int(len(train_dataloader) * num_epochs  * warm)  

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    evaluator=seq_evaluator,
    epochs=num_epochs,
    warmup_steps=warmup_steps
)

#should also give  MRR@10: 0.5241730279898219

datistiquo commented 3 years ago

I tried to figure out the long time, why I get different results running those 2 code snippets. Although I have same random numbers (?) and the Dataset seems to give always the same sequence of examples... Do you see anything @nreimers

nreimers commented 3 years ago

Depending on the Python version, is this https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/datasets/SentenceLabelDataset.py#L54

A non deterministic operation.

Note: It is a bad scientific setup when you fix the seed. Results and conclusions should not be drawn, just because 50 is a good seed and 49 was a bad seed. A proper setup of the experiments does not require fixing the seeds to derive the same conclusions.

datistiquo commented 3 years ago

Depending on the Python version, is this https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/datasets/SentenceLabelDataset.py#L54

A non deterministic operation.

Which part exactly is it? But I checked, and the sequences are the same for a specific seed... How could you fix this? Because otherwise your results are not usable in practice and also not reproducible.

Note: It is a bad scientific setup when you fix the seed.

What about Reproducibility? If you do not fix the seed you get very different results each fit. Also I fixed it just for comparison and run for other seeds too like in this one paper about the seeds and weights! I actually cannot follow, because seeds are important for your tasks as results can vary strongly and the results are not reproducible. I cannot see why this is bad. Let's say you have a downstream task of question answering, then you would tune your parameters for a specific seed. Otherwise you cannot check the found result as it different and might no be as good as found.

The paper also runs for 20 seeds and compares the results. So this is not scientific? ;-)

nreimers commented 3 years ago

What about Reproducibility? If you do not fix the seed you get very different results each fit.

That is the issue: From a single seed, you cannot say if e.g. LossFunction1 or LossFunction2 is better. The difference might just due to the seed.

Hence, to check for this, you must train both with a large number of random seeds, e.g. 10 different random seeds, average the results and check if the difference is statistically significant.

But in that case, you don't need to fix the seeds (e.g. seed 1 ... 10). You can just train with 10 random seeds.

If your setup is scientifically is sound, you will come to the same conclusion independent which seeds you were using. You could use the seeds: 1, 2, 3, ... 10, 20, 30, ... 540, 541, ...

You should come to the same conclusion (+/- what you compute for your statistical significance).

As the concrete seeds are not important, fixing the seeds is not necessary.

datistiquo commented 3 years ago

You mean this in a scientific context? I mean this in a practicle production context.

What about Reproducibility? If you do not fix the seed you get very different results each fit.

That is the issue: From a single seed, you cannot say if e.g. LossFunction1 or LossFunction2 is better. The difference might just due to the seed.

I mean with that I want a model like idientfiying duplicates and should yield the same performance by means of a metric like found by parameter tuning. But this is not possible without fixing the seed as performance alters and get mostly worse.

And again the paper about the seeds... They I think also fix the seeds and run for 20 seeds and find that some seeds might yield better performance...

PaulForInvent commented 3 years ago

@nreimers Is iterating though a dict via

for e in label2ex: ...

the same non detemernistic opeation like via items() ?

nreimers commented 3 years ago

@PaulForInvent Yes, in old Python version dicts are not ordered. In newer, they are, but you should not rely on.

Also note that Pytorch uses non deterministic algorithms, which can be disabled: https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms

PaulForInvent commented 3 years ago

With old version you mean <3.7?

I experienced a very strange thing. I set all seeds and tested that the results are the same. But next day restarting everything and wanted to check again, all results are different. So there has to be some randomness. I train on a cuda gpu and a dataset using this dict loop. But setting seeds gives always same results at least inside the same environment.

Using not this dict loop seems giving always same results also with cuda. Do you know when and where in you framework this pytorch non deterministic algorithms are used?

nreimers commented 3 years ago

Sadly don't know.

As mentioned, fixing seeds is a bad scientific setup so I never do it.

PaulForInvent commented 3 years ago

Anyway thanks.

Sadly don't know.

You mean about the non deterministic algorithms? Is it safely possible to use your framework and setting pytorch deterministic?

nreimers commented 3 years ago

Try it and see if you get a warning

PaulForInvent commented 3 years ago

As mentioned, fixing seeds is a bad scientific setup

I understand what you mean by scientific. But any deterministic results are good to confirm specific results and for eperimentation, debugging,..., right?

If you cannot use a newer python version (limited to 3.6) have you any suggestion how to use an alternative to this dict loop? Maybe I try replacing this and check if this helps for me...

nreimers commented 3 years ago

You can try something like for e in sorted(list(label2ex.keys()))

PaulForInvent commented 3 years ago

@nreimers I trained a model and saved it via model.save(). Now loading it, does also not reproduce the same results as where it was trained and saved on. Is this possible? Why does loading a saved model does not give exactly the same results?

nreimers commented 3 years ago

This should not happen. Are you sure you saved the right model / got the scores for the right model?

The fit method saves the model with the highest dev score. If you call save after fit, the latest model is saved.

PaulForInvent commented 3 years ago

I think this was the problem, because I save the trained model afterwards, which is the model at the last epoch but doe snot need to be the best model from a prev epoch...

PaulForInvent commented 3 years ago

@nreimers Do you have any suggestions to save a best model without saving all models by defaults? This can be the case for hyperparameter tuning. Without customizing your framework using a instance variable for a current best model, I have to save somehow all models...

nreimers commented 3 years ago

@PaulForInvent Sadly not a good solution. When doing hyperparameter tuning, you can check if the performance is above a certain threshold. If not, you could delete the model again.

UKPLab / sentence-transformers

Unexpected Randomness in new SentenceLabelDataset (or BatchHardLoss)? #743