Closed ttimbers closed 2 years ago
textbook feels more like course notes than a stand-alone text e.g.:
book does not take a very opinionated stance on some of the real-world skills needed to succeed in data science (e.g. Ch 4 took a strong and opinionated stance on the right way to effectively explain a visualization). would find the whole book significantly more valuable if this voice were present throughout.
some aspects of dplyr
and ggplot2
that should be updated to the “latest and greatest” approach (following the recent release of dplyr 1.0.0) over older alternatives
tying together intro to tidyverse
(like R for Data Science) and a classic introduction to modelling (like ISLR) and showing the complete workflow. book can be crisper on each step of the process and provide a view on the algorithmic, technical, and creative sides of data science. My favorite parts of the book came when the authors took a strong and opinionated stance (e.g. on the right way to effectively explain a visualization), and I would find the whole book significantly more valuable if this voice were present throughout.
[ ] We need to include exercises (that's easy -- just include tut + wksht) (+1 Reviewer E and B)
[ ] Additional resources (in all chapters) should be annotated to explain what students should hope to learn from reading them. Beyond annotations of what is in the resource, but explain what topic they should focus on next. Add a "where to go from here" chapter? Rebut: will do better at the end of each chapter
[ ] Expand the brief introduction paragraph in Chapter 1 to be more clear about what the book covers. Rev D suggests a whole chapter for this. Rebuttal needed here. Rev A suggests just expanding the paragraph. (Reviewer C) Introduction seems most problematic… lots of things covered at a high speed and they are repeated (reiterated?) in later chapters… e.g. grammar of graphics – at least some understanding of basic PRINCIPLES, rather then these are the steps to follow would be good.
select
& filter
to wrangling (chapter 3) and make sure this doesn't negatively affect chapter 2[ ? ] we could put tables throughout the text to highlight the overall set of "options" for a given task (see the detailed comment on Ch4 Viz for an example of what RevD means)
The book makes a lot of reference to doing things on JHub / JLab. How should the reader follow these things? Some reviewers like the focus on JHub/Lab, some really don't. Some want more reasoning behind it. (Reviewer C) move Jupyter-related content to an appendix on Jupyter, and add an appendix section on RStudio? These are to two-most common platforms for using R (would address review F's first global concern). This aligns with the comment from C: Version control sections' reliance on the Jupyter Lab Git Extension... RStudio has a extension too... Move the collaboration section to the end of the book (+ 1 Reviewer E: consider removing references to JupyterHub or moving them to an appendix. All of the book content is highly relevant apart from any specific analytical platform, so this could increase the book’s appeal to a wider audience.)
There are a lot of links to external material (videos, worksheets, etc) that won't make sense for the general reader (not dsci100 students), and won't translate to print format. This will be helped a bit if we include tuts and wkshts along with the book. But the print format videos and 3d plots etc still need fixing.
[ ] make a better title (it's too dry) - revisit title discussion we had earlier
rename "chapter learning objectives" to "skill building objectives"? (Reviewer E) I would move the point-form learning objectives to an appendix: they really are intended for other instructors rather than for learners. We will rebut
[ ] need to update to the newest dplyr and ggplot2 - Tiffany to check (reviewer A/E?)
[ ] fix the index (a glossary of functions and terms would be good) - also check whether this can be autogenerated by bookdown? Talk to Laura (CRC Press) for help on this? (+1 Reviewer E add glossary of main functions discussed to the appendix.)
[ ] (Reviewer C) Just teach one way to do things (although we can acknowledge there are many ways to do things): to do: find where we have shown alternatives and make sure we are clear the way we want them to do it with tidyverse
[ ] (Reviewer C) datasets may need a bit more introduction. For the future – it may be good to stick with less datasets – to demonstrate how the same data can answer different questions? Rebuttal: generally, we like the richness of that data sets, but we need more information/context about the data sets. Point out where things have been simplified? Maybe move chapter 2 to using only canlang. Go through each chapter and find where we can just have one data set. Chapter 4 needs multiple data sets by the way we have written it. Think about clustering chapter - can we use one that we are already using.
(Reviewer C) Machine learning principles need a chapter on its own (e.g. training/testing datasets, metrics and evaluation of models, bias/variance, etc.). Explaining the logic behind tidymodels also would be great – at this stage authors just “jump” into it.
(Reviewer C) Introduction to statistical inference feels out of place, after Cluster analysis chapter it feels a bit too late…. Rebuttal: beginner level textbook, the idea is that this is the next thing we would study in detail. We made a pedagogical decision for this book and audience to have a very low bar for math/stat/etc)
(Reviewer C) The book feels more like a reference manual or.. workshops materials… It is assumed that “theory” and foundations are covered somewhere else. Some fundamental knowledge and principles are missing – for example, Regression I: K-nearest neighbours section misses the explanation of the method and it is the first part of the regression method.. It just starts with a working example assuming that the readers “get” what the authors are referring to.
(Reviewer C) Some basic introduction of data types is needed, e.g. introduction refers to data frame but “what is this?????” – why should your readers know??? Aligns with this comment from E: - What is a data frame section: Feels like ths should have happened at least one chapter earlier…
(from Advanced R: https://adv-r.hadley.nz/vectors-chap.html#s3-atomic-vectors)
(Reviewer C) Pace: I think the material is too fast and too dense for someone to work through on their own. I think that in a lab setting, with a tutor there to answer questions and elaborate on the provided examples, it would be fine, but as a stand-alone learning resource, it would be a real challenge for most of the learners I’ve had. (This is particularly true of the chapter on Git, which is famously difficult to teach.) Rebut and point to the movement of Git chapter to end, and added explanations and conceptual diagrams throughout, also point out worksheet questions.
[ ] (Reviewer C) Consolidation: there are many places where it feels like the chapters were written independently and then stitched together – some things are covered twice, some things are used before they are defined, etc. Sometimes this is OK (I don’t think the authors need to explain what the ‘min()’ function does before using it), but in general, I think that going through the first five chapters and making a point-form list of every new concept or function introduced, then checking the ordering, would help a lot. Making the index will help with us addressing this
[ ] (Reviewer C) There are also places where it feels like the book is written for other instructors rather than for students – I have noted these in the PDF. Go to individual chapters and address where this has been highlighted by the reviewer.
[ ] (Reviewer C) I think the book has tremendous potential, but as I’ve said above, many readers (particularly those without a strong programming background) will find it extremely dense. More examples, more diagrams, less jargon, and fewer forward references or unexplained concepts and terms will help a lot. Go through book and look for forward references or unexplained concepts, also, we are adding more diagrams (etc version control chapter), rebut - worksheets allow them to revisit concepts in a more interactive and interesting way.
(Reviewer C) I recommend moving the material on web scraping much later in the book - learners are going to be struggling with the tidyverse at this point, and now you’re adding HTML and CSS etc. to their cognitive load. I agree it needs to be covered, but not here… -(+ Reviewer E) Given that understanding of students’ backgrounds, the only section that felt very out of place to be is the part on web scraping. As I elaborate in 11 (see here #102), the prevalence of Javascript pages makes this harder to do with rvest alone, and the increasing prevalence of APIs make this less necessary. This also requires a lot of extra knowledge (e.g. of HTML and CSS) that seem beyond the presumed background. Rebut: this is extra/bonus material, we will make this more clear), also students love this content, so we don't want to remove it
(Reviewer E): textbook feels more like course notes than a stand-alone text e.g.:
[ ] (Reviewer E) book does not take a very opinionated stance on some of the real-world skills needed to succeed in data science (e.g. Ch 4 took a strong and opinionated stance on the right way to effectively explain a visualization). would find the whole book significantly more valuable if this voice were present throughout. (Reviewer E) - tying together intro to tidyverse
(like R for Data Science) and a classic introduction to modelling (like ISLR) and showing the complete workflow. book can be crisper on each step of the process and provide a view on the algorithmic, technical, and creative sides of data science. My favorite parts of the book came when the authors took a strong and opinionated stance (e.g. on the right way to effectively explain a visualization), and I would find the whole book significantly more valuable if this voice were present throughout.
(Reviewer E) I think the book could benefit by discussing feature engineering. I see a common misconception with early-stage data scientists that they can simply model off of whatever fields happen to appear in their table. This leaves a lot of power on the table. Critically thinking about what features to include in a model is particularly important with algorithms like KNN (a focus of the book) for which performance can be significantly degraded with irrelevant predictors.. Additionally, this is a way to link the skills learned in the first and second halves of the book.
(Reviewer E) While the book may not explain many different algorithms, could do more to acknowledge their existence. readers might come away thinking that classification is KNN, regression is linear modeling, clustering is K-means, and inference is comparison of means. It might be good to list other common algorithms in each chapter and even briefly mention when alternatives might be preferred. Will be addressed by one of the action items above
(Reviewer E) Chapter 5 felt somewhat misplaced in terms of placement and content.
(Reviewer E) Some of the examples do a great job of “spiraling” and iteratively mixing new and previously learned concepts step-by-step. This is especially true in the ggplot2 section. This is a fantastic approach, and I would love to see it applied more consistently. Unfortunately, many parts of the book function in silos; for example, none of the modeling sections rely much on previous teaching of dplyr for EDA which may leave students wondering why they even learned that content. Rebuttal: we don't do this to focus on the modelling stuff, but this happens in the worksheets.
[ ] Create a putting it all together chapter (build off this: https://github.com/ttimbers/breast_cancer_predictor), minimum for classification, maximum for all - important to put it after the modelling chapters (leave some complexities in these examples, like cleaning data, etc)
(Reviewer E) I also do notice that all examples are definitely “toy” examples and dodge a good bit of the real world complexity of data analysis. This is likely by design because it helps students focus on key concepts, but it might be worth acknowledging some of the simplifications (e.g. data cleaning, feature engineering, checking for missing data, constraining assumptions of different algorithms) so students are not blindsided when they encounter these in practice. Will be addressed by one of the action items/rebuttals above
(Reviewer E) This book is definitely on the shorter side, but that may be particularly appealing to students and make it seem very manageable for use in either a course or self-study. That said, throughout this review, I do suggest some other potential types of content, and I do not think making the book longer to cover more ground would be a problem either. If the book does grow in length, the author could mitigate any downside by laying out different “learning paths” or highlighting required versus optional chapters in the introduction. Will be addressed by one of the action items/rebuttals above
[ ] Reviewer D comment re: inductive vs deductive approach - to do: add a summary of where we are going for each walkthrough example in the book (especially 6, 7, 8).
Revision of synthesis into action items. We might want to pull these off into individual issues we can close as we address them, and assign folks to them.
Revise chapter 1 (introduction): Expand the brief introduction paragraph in chapter 1 to be more clear about what the book covers. Specifically:
select
& filter
to wrangling (chapter 3) and make sure this doesn't negatively affect chapter 2This addresses comments by Rev D, Rev A & C. Need to rebut D's ask for a whole new chapter here. MAJOR
Move version control chapter to the end of the book and revise chapter to be more conceptual. Add more conceptual content to the vc chapter, and diagrams (like these), and move the screenshots to a screen cast with a stable link. Might want to consider doing both a Jupyter Git Extension demo, and an RStudio one. This addresses comments made by Rev C & F. MAJOR
Simplify and better explain data sets. Where we can, provide more information/context about the data sets (maybe in a call out box or something?). Also, make it clear where things have been simplified and why (so we can focus on the data science method we are teaching). At a minimum, we need to explicitly state that data science cannot be done without a deep understanding of data and domain, and that we are approaching things the way we are to teach data science, and IRL data science should not be done without a domain expert, or alternatively, it is common to practice data science in your domain of expertise.* Go through each chapter and find where we can just have one data set. Idea: see if we can have chapter 2 only use canlang
data sets (might not work for web scraping, but maybe there's a more related data set? Note: Chapter 4 needs multiple data sets by the way we have written it. Question: Think about clustering chapter - can we use one that we are already using? This addresses comments made by Rev C & E, but we do need to also generate a rebuttal here stating why we have chosen a rich set of data sets for this book. MAJOR
Draft a putting it all together chapter. Create a putting it all together chapter, where we demonstrate an entire DS workflow, from reading data , to EDA, to modelling, and communicating the results. We can build off a project Tiffany has created for MDS: https://github.com/ttimbers/breast_cancer_predictor. At a minimum we do this for a classification example, at a maximum we do this also for all modelling methods in the book. Or some intermediate goal. MAJOR
Move Jupyter-related content to System-setup chapter: Rename "Moving to your own machine" to "System setup" (or something related? Like "Setting up your computer?"?) and move any Jupyter-related content there. We can then link to it from other chapters if needed. Bonus: can we also explain how to get setup and use Rmd with RStudio so our book can support both major DS literate code document platforms? Or at a minimum link out to other good resources on this (risk: they don't come back to us...). UI (how to use Jupyter & Rmd) stuff becomes videos with stable links. Make sure videos are general enough for the book, and not specific for this course. This addresses comments made by Rev C, E. MAJOR
Revise supervised learning chapters.
This will address comments made by Rev C & E. MAJOR
Fix/improve index. We need a robust index for this book. Check whether this can be autogenerated by bookdown
? Talk to Laura (CRC Press) for help on this if we need to? Also, once we create the index, we want to create a glossary of the main terms and functions. Let's consider using the glossario
R package for this, and borrowing from the Carpentries English glossary? This addresses comments made by Rev E. MINOR/MAJOR?
Ensure book is written for intended audience. Read through reviewer C's annotations and address highlighted parts where book appears to be written for other instructors rather than for students. This will address comments made by Rev C. MINOR/MAJOR?
Summaries of where we are going. Read through the book, and ensure there is a summary of where we are going for each walkthrough example in the book (especially 6, 7, 8). This will address comments made by Rev D. MINOR/MAJOR
Clarify examples from the the universe of possibilities. Read through the book, and clarify where what we are discussing is meant to illustrate an example versus the universe of possibilities -- worth being more specific about things that are just high-level examples (e.g. KNN versus all classification algorithms). For example, make it clear what parts of chapter 7 are relevant to classification in general, and which are relevant to just k-nn MINOR/MAJOR
Stable domain and links. Get a domain, that we can use to come up with stable links for linking to videos and worksheets from the textbook. MINOR
Exercises/worksheets. Create repository for just worksheets to be associated with the textbook. (Bonus 1: Perhaps we can use Binder or a Public JupyterHub to make them interactive? Bonus 2: Add GitHub Actions to the repo to use Jupytext to autogenerate Rmd's of the worksheets for folks who'd prefer to work with Rmd instead of Jupyter?) Point to the relevant worksheet at the end of each chapter using a stable link. Remember, when we point to the worksheets, to also point to the system setup chapter so they can first follow the instructions of setting up Jupyter on their own machine. (+1 Reviewer E and B) MINOR
Improve additional resources sections. Add a few sentences to give context to each additional resource we share. This means going just beyond annotating them (we should also do that), but also explain what topic they should focus on next. Rebut: We will not add a where to go from here chapter, asked by reviewer ?, but will instead do better job at the end of each chapter, on a topic-by-topic basis. MINOR
Update to the newest dplyr
and ggplot2
. Check most recent updates to dplyr
and ggplot2
and make sure we are using the most up to date syntax in the books. If this needs to change, also change in the worksheets. This addresses comments made by Rev A & E. MINOR
Ensure we just teach one way to do things. Read through the book and check that we only teach one way to do things (although we can acknowledge there are many ways to do things): Specifically, find where we have shown alternatives and make sure we are clear the way we want them to do it with tidyverse
. This addresses comments made by Rev C. MINOR
Improve regression introduction. Lead into regression more gentle and rely less on past knowledge about classification (8.3 is a bit bizarre, should combine with 8.4). This will address comments made by Rev C. MINOR
Add vector data types explanations. Add base R vector types (logical, integer, double, character) introduction and explanation to section 3.3.2 ("What is a vector?") in chapter 3. Add factor vector type explanation in chapter 4 (visualization chapter), when we need it. This will address comments made by Rev C & E. MINOR
Forward references & unexplained concepts. Go through book and look for forward references or unexplained concepts. MINOR
Make it clear web scraping is optional/add API's. Add a note to make it clear that web scraping is optional. Fix the wrong definition of web scraping. Add a subsection on web API's (we could use this one: https://cran.r-project.org/web/packages/cancensus/vignettes/cancensus.html). MINOR
These things we are purely rebutting, not changing.
tidyverse
(like R for Data Science) and a classic introduction to modelling (like ISLR) and showing the complete workflow. book can be crisper on each step of the process and provide a view on the algorithmic, technical, and creative sides of data science. My favorite parts of the book came when the authors took a strong and opinionated stance (e.g. on the right way to effectively explain a visualization), and I would find the whole book significantly more valuable if this voice were present throughout.
@ttimbers @leem44 this is done now, right?
Closing; we can re-open if needed.
@trevorcampbell @leem44 - let's list the global/big picture content revisions asked for in review here. When adding please do not duplicate things, just edit the list noting that it was asked for by X number of reviewers.