Closed psteinb closed 2 months ago
I'd propose fashionMNIST. This should be as simple as
keras.datasets.fashion_mnist.load_data()
Thanks so much for the suggestion, @psteinb. As I mentioned on the review thread, I plan to spend some time trying to replace CIFAR-10 in the episode soon, hopefully next week. Your proposal seems like a great place to start. I will report back here on my progress...
+1 for fashionMNIST @tobyhodges let me know if you don't have time, than I will pick this up!
@tobyhodges I would suggest to first test things out in a (or multiple) jupyter notebook and have someone review it, before going into changing the actual lesson material.
Re: fashionMNIST, while this is a nice dataset (easy to communicate about), there are two differences to CIFAR10:
I tried to work through the episode using fashion-MNIST as suggested by @psteinb. You can see my process and results in https://github.com/carpentries-incubator/deep-learning-intro/blob/testing_fashion_MNIST/fashion_MNIST.ipynb
A summary of my observations:
@tobyhodges thank you for working on this! This looks great actually 👍 Seems like you are having fun with keras :)
OK, seems like the fashionMNIST dataset is a bit of a weird ML problem. It is very hard to overfit, and regularization only results in worse performance in the end;) I tried some approaches to force it into overfitting (bigger models, smaller dataset) but no good results.
I am looking into other datasets, but oh my god there are so many CC-BY licensed datasets that are actually just crawled from the web.
there are so many CC-BY licensed datasets that are actually just crawled from the web
I think we should put a callout into the episode that addresses this TBH
I am now looking into the dollar street dataset, it is from gapminder so fits nicely in the carpentries philosophy!
The dollar street dataset is 101GB! Is there a subset of it available?
@colinsauze I haven't found a subset, but I just downloaded it and if it works well with the lesson I will make a subset available with low-res pictures.
If the license permits, we can publish a subset on FigShare, Zenodo, or similar.
Checkout my notebook using the dollar street dataset for episode 4.
This dataset really allows to demonstrate all our points:
@colinsauze @psteinb @dsmits @tobyhodges what do you think? I think we should choose between FashionMNIST and dollar street
Use the dollar street sign dataset. Looking back at this conversation, this choice would not require too much adaptation with respect to text.
Best, P
This dollar street dataset sounds great to me, @svenvanderburg. Thanks so much for taking the time to explore it.
One note: I saw you used PIL for loading and re-sizing the images. Would it be possible to switch over to using scikit-image (and imageio for the loading part)? That way we can point people to DC Image Processing if they want to learn more about handling image data in Python?
Would you like to hold a coworking session/sprint to prepare the updated episode? Or prefer to draft something yourself then ask others to review?
This dollar street dataset sounds great to me, @svenvanderburg. Thanks so much for taking the time to explore it.
One note: I saw you used PIL for loading and re-sizing the images. Would it be possible to switch over to using scikit-image (and imageio for the loading part)? That way we can point people to DC Image Processing if they want to learn more about handling image data in Python?
The data preprocessing will not be done in the course, the starting point of the episode will be to load the data in its preprocessed form. To focus on deep learning instead of image data wrangling.
Would you like to hold a coworking session/sprint to prepare the updated episode? Or prefer to draft something yourself then ask others to review?
I want to get this through, it has been hanging for so long now. So I will draft something today. But I will organise a new sprint soon to pick up the remaining maintenance issues. I hope that's OK?
I'm working on it here: https://github.com/carpentries-incubator/deep-learning-intro/pull/448 The data is located here: https://zenodo.org/records/10837090
I hope to continue this next week, if anyone wants to pick this up in the meantime you are welcome! (Or start on transfer learning episode for example which would be really nice to add now).
Argh... I have very little time for this now. I plan to pick this up again 15th and 16th of April.
If you are happy for me to commit to your branch, @svenvanderburg, I can try to step in and make some further changes?
As just discussed here it might be worth considering to replace CIFAR-10 for some other dataset.