dwadden / multivers

Code and model checkpoints for the MultiVerS model for scientific claim verification.
MIT License
44 stars 11 forks source link

About reproducibility #9

Closed helena-1115 closed 1 year ago

helena-1115 commented 1 year ago

Hi, David:) Thanks a lot for sharing your great work!

I am currently struggling on reproducing the results. I tried finetuning on covidfact and scifact_20 dataset with one GPU(RTX 3090), but the results are different each time.

Looking at the metrics.csv file(in checkpoints folder) generated during finetuning, the label_loss, rationale_loss, and loss are recorded differently each time, so it seems that the randomness is not controlled during the training process.

When I looked at the code, I think there is no problem with the dataloader part because it is fixed. I tried to add the code (in anaconda3/envs/multivers/lib/python3.8/site-packages/pytorch_lightning/utilities/seed.py) for setting the seed as below, but it is still not reproducible.

reproducibility

I wonder if there are any parts that need additional modification or if there are parts difficult to reproduce perfectly. Thank you~~

dwadden commented 1 year ago

Thanks for raising this issue. I tried to control for randomness by calling pl.seed_everything in the training script. Unfortunately, I think that GPU training is inherently nondeterministic for reasons that I don't understand. Are you getting results within the same ballpark (within an F1 point or two) across runs, or are you getting wildly different results?

helena-1115 commented 1 year ago

In the case of the covidfact dataset, I'm getting results within the same ballpark (within an F1 point or two), so that seems fine. For the scifact_20 dataset, the results are quite different. So I'm thinking of averaging multiple results. Thanks for the quick response:)

dwadden commented 1 year ago

Hmm interesting, can you give me a sense of how different they are for scifact_20? I can try a few training runs and see if I get the same level of variation.

helena-1115 commented 1 year ago

Yes:) For the scifact_20 dataset, the results are different as shown in the two figures below!

scifact_1

scifact_2

dwadden commented 1 year ago

Hmm so the results seem to vary by a few F1 points. Interesting. I'm traveling for the next couple weeks, but when I get back I'll try to take a look.

dwadden commented 1 year ago

Unfortunately I think I'm not going to have the bandwidth to run more experiments on this. I agree it's strange that there's so much variance between runs. If you have results on this, feel free to submit a Markdown document to the doc folder describing your findings and I can merge it in; that way we'll at least have this issue recorded.

helena-1115 commented 1 year ago

Thank you for taking a look! Perhaps trying multiple experiments and averaging the results could be a potential solution.