fastai / fastai_dev

fast.ai early development experiments
Apache License 2.0
638 stars 350 forks source link

_get_files doesn't return files in a deterministic order across OSes #239

Open dcato98 opened 4 years ago

dcato98 commented 4 years ago

_get_files in local.data.transforms.py doesn't return files in a deterministic order across OSes.

This is an issue when getting files, then splitting using a fixed seed. For example, in 08_pets_tutorial.ipynb (I added the seed parameter):

items = get_image_files(source)
split_idx = RandomSplitter(seed=42)(items)

In this case, 2 users on different OSes would have the same split_idx, but different train/validation sets.

It would be straightforward for a user to correct this by sorting items before passing this list into the splitter, but I wouldn't expect that many people would know to do this.

rmkn85 commented 4 years ago

I had an issue with matching a sorted CSV file with labels, to files from folder, only to find that get_image_files was returning in arbitrary non sorted order. Also think that it's good practice to sort by default so it's deterministic (and then shuffle when needed).

tacchinotacchi commented 4 years ago

It may be possible to introduce some sorting criteria, but I'm wondering whether sorting everything alphabetically could be a problem. Maybe they can be sorted based on a hash function on the filename?

EDIT: it would also be possible to sort, then shuffle randomly. The order would be deterministic if the random seed is set.

dcato98 commented 4 years ago

I originally thought this is only an issue when using different OSes, but I now noticed that it returns a different order on 2 different Ubuntu systems.