Here are some of my notes and possible revisions from the pilot workshop. We can discuss these in person before implementing any changes.
Agenda and slides:
[x] Look for places to add more interactivity in the initial slides. Could ask about ML examples in their area after showing the examples in the slides.
[ ] Discuss the relationship between the classifiers we present their regression analogs.
[x] Could expand the setup guide and ask participants to try installing the software before the workshop.
[x] Use the same dataset in the initial slides, notebook, and software example to avoid having to explain multiple datasets early on.
[x] Show the notebook at the end of the workshop and illustrate the ML pipeline in the software.
[x] Define folds on a slide.
[x] Add more examples of why train/validate/test split is needed. Move the data splitting and cross validation discussion even earlier. Perhaps introduce overfitting at this point.
[ ] Note the other cross-validation strategies in the slides and link to the vocabulary guide.
[x] Add discussion in GitHub Issues as another next step in the slides.
[x] Annotate the y and y hat notation in logistic regression.
[x] Consider showing and example of how a trained logistic regression model makes a prediction y hat.
[ ] Add useful discussion points to the notes section of the slides to help new instructors lead the workshop. (#20)
[x] Update gender in decision tree example.
[x] Provide hints at which datasets and settings to use to explore the questions.
[x] After the example ML papers, go into more detail for one: features, class labels, classifier, what was learned and why it matters.
[ ] Work on a correspondence between a real biological problem and a 2d toy example.
[x] Reference Google crash course for a possible ordering.
[x] Another possible ordering: ML motivation with examples, test out 1-2 classifiers in the software, learn about them in more depth in the slides, revisit classifiers in software with knowledge of the hyperparameters, overfitting/underfitting and cross-validation, compare selecting on training set only (hold out 0%) versus cross-validation, then finish with data loading other classifiers.
Software:
[x] Mac OS opens the wrapper script in an editor instead of executing it. Need alternative instructions for launching the software.
[ ] Need more guidance for running the software on Windows when Anaconda is not on the PATH. (#21)
[ ] Determine why Windows does not launch the GUI the first time the batch script is run. (#21)
[ ] Add a note about the warning Windows shows about running a batch script from an unknown publisher. (#21)
[x] Add note about common NumPy or other warnings that can be safely ignored.
[x] Clear the unlabeled data after loading a new labeled dataset
[ ] Neural networks do not provide a class weight. This is because scikit-learn does not implement it yet. See pull request https://github.com/scikit-learn/scikit-learn/pull/11723 for progress. (https://github.com/gitter-lab/ml4bio/issues/8)
[x] Create an issue with the error message a student received on Mac OS. (#15)
Data and guides:
[x] Document what pre-processing was done in the neurotoxicity dataset to reduce the features to 1000 genes. This is in the paper.
[ ] Update the performance guide to explain the performance of a random classifier and how the area depends on the class imbalance. (#22)
[x] Consider adding a toy dataset that is imbalanced and non linearly separable to help explore different performance measures.
[x] Data cleaning and pre-processing guide with examples (data carpentry resources?)
@fmu2 completed almost all of these suggestions from the 2018-08-23 workshop. I created new specific issues for the remaining comments we may want to address. The others can be safely ignored in my opinion.
Here are some of my notes and possible revisions from the pilot workshop. We can discuss these in person before implementing any changes.
Agenda and slides:
Software:
https://github.com/scikit-learn/scikit-learn/pull/11723
for progress. (https://github.com/gitter-lab/ml4bio/issues/8)Data and guides: