psteinb commented 4 months ago

As just discussed here it might be worth considering to replace CIFAR-10 for some other dataset.

psteinb commented 4 months ago

I'd propose fashionMNIST. This should be as simple as

keras.datasets.fashion_mnist.load_data()

tobyhodges commented 4 months ago

Thanks so much for the suggestion, @psteinb. As I mentioned on the review thread, I plan to spend some time trying to replace CIFAR-10 in the episode soon, hopefully next week. Your proposal seems like a great place to start. I will report back here on my progress...

svenvanderburg commented 4 months ago

+1 for fashionMNIST @tobyhodges let me know if you don't have time, than I will pick this up!

svenvanderburg commented 4 months ago

@tobyhodges I would suggest to first test things out in a (or multiple) jupyter notebook and have someone review it, before going into changing the actual lesson material.

psteinb commented 4 months ago

Re: fashionMNIST, while this is a nice dataset (easy to communicate about), there are two differences to CIFAR10:

different shape 28x28 instead of 32x32
fashionMNIST is greyscale instead of RGB We should keep this in mind as it will affect teaching.

tobyhodges commented 3 months ago

I tried to work through the episode using fashion-MNIST as suggested by @psteinb. You can see my process and results in https://github.com/carpentries-incubator/deep-learning-intro/blob/testing_fashion_MNIST/fashion_MNIST.ipynb

A summary of my observations:

Please check my working. I might have implemented the convolutional layers incorrectly, in which case my results can be ignored! I needed to adjust the code to account for the single channel of the images, and I am not certain I did that correctly.
The differences between the results of the different kinds of NN do not appear to be as stark when applied to this dataset. As such I think there are a few statements, plus possibly the final challenge where we vary the dropout rate, that might no longer hold/be interesting (again, assuming my implementation was correct!)

svenvanderburg commented 3 months ago

@tobyhodges thank you for working on this! This looks great actually 👍 Seems like you are having fun with keras :)

That looks good
Yes, you are right that the differences are not very strong. It seems like the first simple model is actually a really good model that is hard to improve. The val accuracy of 0.856 of the first model is actually not beaten in the rest of the episode. I will see if I can tweak things a little bit to make this fit better in the storyline. Or we can try other datasets, for example the orginal mnist (but this is an even simpler problem, so I don't think it will help) or https://data.caltech.edu/records/mzrjq-6wc02 (seems to have a CC-BY 4.0 license but I suspect that the images are crawled).

svenvanderburg commented 3 months ago

OK, seems like the fashionMNIST dataset is a bit of a weird ML problem. It is very hard to overfit, and regularization only results in worse performance in the end;) I tried some approaches to force it into overfitting (bigger models, smaller dataset) but no good results.

I am looking into other datasets, but oh my god there are so many CC-BY licensed datasets that are actually just crawled from the web.

tobyhodges commented 3 months ago

there are so many CC-BY licensed datasets that are actually just crawled from the web

I think we should put a callout into the episode that addresses this TBH

svenvanderburg commented 3 months ago

I am now looking into the dollar street dataset, it is from gapminder so fits nicely in the carpentries philosophy!

colinsauze commented 3 months ago

The dollar street dataset is 101GB! Is there a subset of it available?

svenvanderburg commented 3 months ago

@colinsauze I haven't found a subset, but I just downloaded it and if it works well with the lesson I will make a subset available with low-res pictures.

tobyhodges commented 3 months ago

If the license permits, we can publish a subset on FigShare, Zenodo, or similar.

svenvanderburg commented 3 months ago

Dollar street dataset

Checkout my notebook using the dollar street dataset for episode 4.

Results

Simple CNN, val accuracy: 0.26
Simple CNN with dropout, val accuracy: 0.33 and overfitting is reduced a bit
Pretrained SOTA CNN, val accuracy: 0.67 and barely overfitting
Dense neural network, val accuracy: 0.17

Conclusion

This dataset really allows to demonstrate all our points:

CNNs work better on image data than dense networks
Dropout reduces overfitting
Pretrained models with a large, established CNN architecture work really well on image data

Next steps

I think with little adaptations to the story in episode 4 we can use this dataset
The episode will end not very satisfactory, even with dropout we only get 30% accuracy. This would be a nice bridge to episode 5: transfer learning. There we show that a pretrained neural network can more than double the accuracy in this case.
I will upload the data to zenodo or figshare in a format that will load numpy train & val images & labels. (or store images as jpeg?)

@colinsauze @psteinb @dsmits @tobyhodges what do you think? I think we should choose between FashionMNIST and dollar street

psteinb commented 3 months ago

Use the dollar street sign dataset. Looking back at this conversation, this choice would not require too much adaptation with respect to text.

Best, P

tobyhodges commented 3 months ago

This dollar street dataset sounds great to me, @svenvanderburg. Thanks so much for taking the time to explore it.

One note: I saw you used PIL for loading and re-sizing the images. Would it be possible to switch over to using scikit-image (and imageio for the loading part)? That way we can point people to DC Image Processing if they want to learn more about handling image data in Python?

tobyhodges commented 3 months ago

Would you like to hold a coworking session/sprint to prepare the updated episode? Or prefer to draft something yourself then ask others to review?

svenvanderburg commented 3 months ago

This dollar street dataset sounds great to me, @svenvanderburg. Thanks so much for taking the time to explore it.

One note: I saw you used PIL for loading and re-sizing the images. Would it be possible to switch over to using scikit-image (and imageio for the loading part)? That way we can point people to DC Image Processing if they want to learn more about handling image data in Python?

The data preprocessing will not be done in the course, the starting point of the episode will be to load the data in its preprocessed form. To focus on deep learning instead of image data wrangling.

svenvanderburg commented 3 months ago

Would you like to hold a coworking session/sprint to prepare the updated episode? Or prefer to draft something yourself then ask others to review?

I want to get this through, it has been hanging for so long now. So I will draft something today. But I will organise a new sprint soon to pick up the remaining maintenance issues. I hope that's OK?

svenvanderburg commented 3 months ago

I'm working on it here: https://github.com/carpentries-incubator/deep-learning-intro/pull/448 The data is located here: https://zenodo.org/records/10837090

I hope to continue this next week, if anyone wants to pick this up in the meantime you are welcome! (Or start on transfer learning episode for example which would be really nice to add now).

svenvanderburg commented 3 months ago

Argh... I have very little time for this now. I plan to pick this up again 15th and 16th of April.

tobyhodges commented 3 months ago

If you are happy for me to commit to your branch, @svenvanderburg, I can try to step in and make some further changes?

carpentries-incubator / deep-learning-intro

CIFAR10 not having a license #445

Dollar street dataset

Results

Conclusion

Next steps