Switch to a better dataset

colinsauze commented 5 years ago

The lesson ideally needs to use one dataset throughout. Its currently a bit of a mixture with gapminder, world bank, hand written digits and randomly generated data.

Suggestions from Carpentry Connect Manchester include:

Edinburgh cycle data: https://edinburghcyclehire.com/open-data
Possibly coupled with weather data: https://www.metoffice.gov.uk/research/climate/maps-and-data/data/haduk-grid/data-formats
Seattle cycling data: https://jakevdp.github.io/blog/2015/07/23/learning-seattles-work-habits-from-bicycle-counts/
Wine data https://archive.ics.uci.edu/ml/datasets/Wine
Titanic https://www.kaggle.com/c/titanic
Breast cancer data from sklearn
Kaggle competition datasets

vinisalazar commented 3 years ago

I visited some of these links, here are some quick impressions:

Edinburgh cycle data: https://edinburghcyclehire.com/open-data

Looks interesting but it's mainly time and location based. I would probably favor a dataset with mostly counts data, and maybe some categorical variables.

Possibly coupled with weather data: https://www.metoffice.gov.uk/research/climate/maps-and-data/data/haduk-grid/data-formats

These seems to be available in netCDF only. Although Python has excellent tools to deal with the format, it seems like an unnecessary cognitive load.

Seattle cycling data: https://jakevdp.github.io/blog/2015/07/23/learning-seattles-work-habits-from-bicycle-counts/

Same as the Edinburgh data. The URL does not seem very maintainable. Also, I'd avoid presenting a third-party analysis (although the post is a really good one) at the start / setup of the lesson, as it may distract learners. It could perhaps be presented afterwards.

Wine data https://archive.ics.uci.edu/ml/datasets/Wine

I quite like this one, and specially the fact that it is deposited in the UCI MLR, because it is very well-known and seems very stable. The only downsides are the lack of categorical variables and the lack of a header with column names in the raw data file.

Titanic https://www.kaggle.com/c/titanic

This dataset seems very appropriate, but I dislike the fact of needing to accept the Kaggle Terms of Service in order to be able to download it. It would be much nicer to simply have an URL or repository that can be downloaded with wget or some other tool. A second disadvantage is that it is already split into Training and Test datasets. I guess it would be nicer to have a 'full' dataset and introduce the concept of splitting it further in the lesson.

Breast cancer data from sklearn

This is the dataset I like the most from that list. Being a biologist, I am inevitably biased towards using it :) . I also really like the fact that it is already built into Scikit Learn.

Kaggle competition datasets

Same comment as the Titanic dataset. One that I really like is Palmer Penguins!

colinsauze commented 3 years ago

Thinking about the requirements for a dataset it ideally needs to work with all of the following:

linear regression logarithmic regression clustering (non deep learning) neural networks unsupervised dimensionality reduction such as PCA or t-SNE

Assuming the licensing permits we can always redistribute the dataset along with this lesson (as is currently being done). This still lets us use the wget/curl method to download while having a stable URL.

I also like the idea of the Palmer Penguins, its being used in the introduction to deep learning lesson too and I envisage that these two lessons should be complementary.

bkmgit commented 3 years ago

An interesting data set:

https://github.com/MedMNIST/MedMNIST This could be used in the dimensionality reduction lesson to complement the MNIST dataset https://github.com/carpentries-incubator/machine-learning-novice-sklearn/pull/15

vinisalazar commented 3 years ago

That MedMNIST dataset is quite interesting indeed. However, after reflecting and some conversations with other members of the community, I would tend to avoid medical datasets (including the Breast Cancer data that I endorsed in a previous comment), as people can be sensitive to them.

colinsauze commented 3 years ago

In the long term I do wonder if there is a way we could have custom versions of this lesson using different datasets. Then a medical group could use a version with medical data and another group could use their own dataset. But this would add a lot of complexity and I think we've got a lot of much more basic problems to solve first.

Just to add another dataset into the list, there is a weather prediction dataset (https://github.com/florian-huber/weather_prediction_dataset) which is being used by the Deep Learning incubator lesson.

bkmgit commented 3 years ago

There are a number of example datasets used for educational purposes. Assuming the lesson will become part of data carpentry, then one should expect at least a social science track, an ecology track, a genomics track and possibly a geospatial track. Astronomy, economics and image processing tracks are also in development.

Minor changes can be accommodated with selecting options when forking the repository to prepare a lesson - in the same way options are chosen to create a workshop website.

carpentries-incubator / machine-learning-novice-sklearn

Switch to a better dataset #2