SainsburyWellcomeCentre / crabs-exploration

A toolkit for detecting and tracking crabs in the field.
BSD 3-Clause "New" or "Revised" License
7 stars 0 forks source link

Fix validation and test split not being reproducible #218

Closed sfmig closed 4 weeks ago

sfmig commented 4 months ago

Why is this PR needed?

Currently when we create the test and validation splits we don't pass a generator. We do pass one when we create the training split.

This means that given a seed, the splitting of the dataset into train and test-val sets is reproducible, but the subsequent splitting of the test-val set into a test set and a val set is not.

What does this PR do?

This PR:

Smaller bits

Notes

I decided to pass a different generator for each call to random_split to try to make it a bit "future-proof". That way we guarantee the splits are repeatable even if some randomisation code is added in between the two calls to random_split.

codecov-commenter commented 4 months ago

Codecov Report

Attention: Patch coverage is 95.23810% with 1 line in your changes missing coverage. Please review.

Project coverage is 47.75%. Comparing base (a21d4f1) to head (c678df6).

Files with missing lines Patch % Lines
crabs/detector/datamodules.py 75.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #218 +/- ## ========================================== + Coverage 46.68% 47.75% +1.07% ========================================== Files 24 24 Lines 1476 1493 +17 ========================================== + Hits 689 713 +24 + Misses 787 780 -7 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

sfmig commented 4 months ago

Other option, can we just use seed_everything?

Cool, I didn't know about this!

I think for now I'd prefer to constraint the seeding to the dataset creation, because that is the part I need to be reproducible. But good to have this in the radar.

sfmig commented 4 weeks ago

thanks for the help Nik! 🌟