@trevorcampbell @leem44 - let's list the global/big picture content revisions asked for in review here. When adding please do not duplicate things, just edit the list noting that it was asked for by X number of reviewers.

Reviewer E

reservations

textbook feels more like course notes than a stand-alone text e.g.:
- references to the video content could be cleaned up
- additional resources could be annotated to explain what students should hope to learn from reading them
- more frameworks could be described for things such as the modeling process
- it’s less clear what is meant to illustrate an example versus the universe of possibilities -- worth being more specific about things that are just high-level examples (e.g. KNN versus all classification algorithms).
book does not take a very opinionated stance on some of the real-world skills needed to succeed in data science (e.g. Ch 4 took a strong and opinionated stance on the right way to effectively explain a visualization). would find the whole book significantly more valuable if this voice were present throughout.
- TC: I disagree with this comment. Viz is inherently a more subjective topic, and I feel that the approach we took was strongly opinionated in an effort to argue that students should follow our approach. The other bits of the book are less subjective, as there is a definitive "right" and "wrong" way to do those things, and so less opinionated argumentation is necessary.
some aspects of dplyr and ggplot2 that should be updated to the “latest and greatest” approach (following the recent release of dplyr 1.0.0) over older alternatives
tying together intro to tidyverse (like R for Data Science) and a classic introduction to modelling (like ISLR) and showing the complete workflow. book can be crisper on each step of the process and provide a view on the algorithmic, technical, and creative sides of data science. My favorite parts of the book came when the authors took a strong and opinionated stance (e.g. on the right way to effectively explain a visualization), and I would find the whole book significantly more valuable if this voice were present throughout.

additions:

add glossary of main functions discussed to the appendix.
Before beginning modeling, I would like to see the book describe the model building process (data wrangling, picking an algorithm, tuning it with training data, evaluating it on test data, etc.) While all of the components are described, it may be hard for new readers to clearly separate out data, algorithms, parameters, and evaluation metrics without a framework. This could include some of the great content about evaluation that is currently in Chapter 7 but not specific to classification.
I think the book could benefit by discussing feature engineering. I see a common misconception with early-stage data scientists that they can simply model off of whatever fields happen to appear in their table. This leaves a lot of power on the table. Critically thinking about what features to include in a model is particularly important with algorithms like KNN (a focus of the book) for which performance can be significantly degraded with irrelevant predictors.. Additionally, this is a way to link the skills learned in the first and second halves of the book. .
While the book may not explain many different algorithms, could do more to acknowledge their existence. readers might come away thinking that classification is KNN, regression is linear modeling, clustering is K-means, and inference is comparison of means. It might be good to list other common algorithms in each chapter and even briefly mention when alternatives might be preferred.

omissions:

consider removing references to JupyterHub or moving them to an appendix. All of the book content is highly relevant apart from any specific analytical platform, so this could increase the book’s appeal to a wider audience.
Chapter 5 felt somewhat misplaced in terms of placement and content.
- breaks the narrative flow of focusing of data analysis and analytical tools -- might work better in an introduction and set-up set of chapters (perhaps combined with Chapter 12) or in an appendix.
- However, content may not be useful in a book format. Jupyter Notebooks may be a particularly hard type of file to use to teach this skill since they aren’t plaintext and don’t produce clean diffs.
- discussion of GitHub relies on screenshots that risk going stale and doesn’t describe the most common GitHub usage patterns (via command line). Such platforms maintain good documentation, and theirs is less likely to go stale as they change their interface. If you do wish to give instructions, consider describing how to use the command line or GitHub Desktop, as the method shown (uploading manually via the UI) is not sustainable or very representative of common usage.

examples

fatiguing that every single example uses a different and completely unrelated dataset.
- Reader spends more time thinking about the data itself than about the methods being taught, and the lack of emphasis on explaining data makes the dangerous subtle implication that data science can be done without a deep understanding of data and domain.
- rather see a small set of datasets to use consistently throughout the book and explicitly introduce these at the beginning -or perhaps convert the whole book to continually point back to a single more realistic case study.
Some of the examples do a great job of “spiraling” and iteratively mixing new and previously learned concepts step-by-step. This is especially true in the ggplot2 section. This is a fantastic approach, and I would love to see it applied more consistently. Unfortunately, many parts of the book function in silos; for example, none of the modeling sections rely much on previous teaching of dplyr for EDA which may leave students wondering why they even learned that content.
all examples are “toy” examples and dodge a good bit of the real-world complexity of data analysis. This is likely by design because it helps students focus on key concepts, but it might be worth acknowledging some of the simplifications (e.g. data cleaning, feature engineering, checking for missing data, constraining assumptions of different algorithms) so students are not blindsided when they encounter these in practice.
could supplement practice problems with assignments from their corresponding course, which could potentially make this book a much more useful resource for self-study.
- (+1) Reviewer B: I don’t see any exercises, however. I’m guessing the authors have a bunch from their own courses. Will these be posted on a companion website?)

length

This book is definitely on the shorter side, but that may be particularly appealing to students and make it seem very manageable for use in either a course or self-study. That said, throughout this review, I do suggest some other potential types of content, and I do not think making the book longer to cover more ground would be a problem either. If the book does grow in length, the author could mitigate any downside by laying out different “learning paths” or highlighting required versus optional chapters in the introduction.

target audience

Given that understanding of students’ backgrounds, the only section that felt very out of place to be is the part on web scraping. As I elaborate in 11 (see here #102), the prevalence of Javascript pages makes this harder to do with rvest alone, and the increasing prevalence of APIs make this less necessary. This also requires a lot of extra knowledge (e.g. of HTML and CSS) that seem beyond the presumed background.

Reviewer B

would love to hear the reasoning behind choosing JupyterHub. Perhaps in a blogpost when the book comes out? I’m sure a lot of other R-centric instructors would like to know.
I like the idea of exposing students to a workspace that’s Julia and python friendly too. ...this would be like working a makerspace where Julia and python tools were also being used by others. That way R tool users will see/recognize when other tools are being used.

Reviewer D

The titles of the chapters in the second part of the manuscript (from Classification and beyond) seem overly technical. Is there a way to couch them so that students/readers understand from a first glance what is really going on in those chapters? For example, the chapter titled “Moving to your machine” (too technical!) could be called something like “Running R commands on your own computer”. The chapters called “Regression I” and “Regression II” could be titled something like: “Studying the relationship between two quantitative variables I: K-nearest neighbours” and “Studying the relationship between two quantitative variables I: Linear regression”.
- I disagree with this comment. I don't think the chapter headings are technical, and in fact I think the suggestions here are more technical (and verbose). I suggest we do not do this.
There are presently no chapter exercises included in the manuscript.
- This is handled via the worksheets and tutorials
The examples used in the manuscript are generally appropriate, though my least favourite example is the one used in the Introduction. I suggest reworking the Introduction so that it just outlines the big picture without going into any specific examples and technical details as this can be premature and can overwhelm students/readers.
- I disagree. Having students "get their feet wet" immediately with the entire pipeline seems to have worked really well in class. Even weaker students in the class have not struggled with week 1. Perhaps we can add some text to the intro to make it clear that this is a simplified version of the full pipeline students will be learning in the class (although I think that text may already be present).
Replace the current Introduction with a shorter one which does NOT go into such technical detail but rather gives a generic overview of what’s to come
- disagree, see above
There aren’t many tables in the manuscript and I think this is a missed opportunity. I could only find one table, which lists the questions typical of various kinds of statistical analyses but even this was borrowed from a different source and included in the manuscript without much of a preamble. Tables are a great way to show information side-by-side (e.g., pros and cons of a statistical procedure) or to list various options the students can consider (e.g., different types of geoms that could be used to build a ggplot; different types of R import functions that could be used to read in different types of data files). It would be good if the authors could add a few more tables to the manuscript.
- This is not a bad idea, especially for chapters with a lot of such "options" (viz is a great example)
- Just to be clear: tables should not be used for conveying numerical/statistical results/comparisons. Figures are always better.
jupyter notebooks are not required for the material in the course. I think that the book would benefit from showing students that they could run the R commands included throughout the book in a variety of ways. students can copy and paste these commands into an R script and run the commands directly from within R or R Studio. It seems to me that it would be easier for students/readers to first learn how to run R commands on their local computer and then learn how to run R commands on the web using Jupyter Notebooks.
alternate between the “inductive teaching approach” and the “deductive teaching approach”. Much of the manuscript uses an “inductive teaching approach”, where the students are led, step by step, to the final outcome. However, if there are many steps involved and the final outcome is not clearly defined to begin with, it can be difficult for students to follow those steps. The “deductive teaching approach” would be such that students are told upfront what the final outcome should be (so they are clear about what they are targeting in their learning or skill building) and then invited to understand the particular steps needed to arrive at that outcome. Often, it is easy to convert an “inductive teaching approach” into a “deductive” one by simply stating upfront what the outcome of interest is (or what skills have to be acquired) and providing a preview/overview/summary of the steps involved.
The “technical” portions of the manuscripts (e.g., where students learn how to install and use Jupyter notebooks, how to collaborate with version control) could be too advanced for undergraduate students who will read the book as part of their self-study or class work. These could be moved to the end.
- I don't disagree. We could move such chapters to the end of the book (but still teach them in week 5 as we currently do).
The title of the book is dry. Something along the lines “Build Your Data Science Skills with R One Step at a Time” might work better. or “Data Science: A Practical Introduction with R and Jupyter Notebooks”.
- I'm not against brainstorming the title more
Once the Introduction is out of the way, split the remainder of the manuscript into major sections such as: Housekeeping Skills, Data Processing Skills, Data Visualization Skills, Statistical Modelling Skills, Statistical Inference Skills, Collaboration Skills, etc. (Each of these parts will have one or more chapters associated with it.) This will make it easier for students to separate the various skills they need to learn as part of their Data Science journey.
The Housekeeping Skills section will include a chapter which will show students/readers how to set up their computer to be able to run R commands (i) locally (using R or R Studio) or (ii) via the web (using Jupyter Notebooks). A brief discussion of the advantages and disadvantages of using each method would be useful. Students/readers could be encouraged to start their data science journey by using the first method and, when comfortable with it, moving towards the second method
the Chapter Learning Objectives accompanying each chapter could be renamed to Skill Building Objectives and rephrased if need be so they are skill-centered rather than learning-centered. (A learning-centered approach makes more sense in the classroom than in the real world, where skill building prevails.)

Reviewer A

My biggest reservation about the manuscript is the absence of end-of-chapter exercises. I don’t think a novice reader will be able to really grasp the material deeply without some (guided) practice, and end of chapter exercises are a great way to do that. It’s also important to note that competitors to this book do have exercises.
- tutorials + worksheets
no major revisions are needed for existing chapters.
I do think a “Where to go from here?” chapter would help the reader. This is a truly introductory text, so if a there is a reader that isn’t using it for a course, it would be great to provide some guidance about their path forward.
chapter titles are appropriate. I do think using a “multivariate regression” title might be confusing (I’ve marked it on the PDF), since when I hear multivariate regression, I think about modeling multiple response variables simultaneously. Instead, perhaps multiple regression or regression with multiple predictors would be clearer.
expand the brief introduction of what data science is and what the book will cover. Maybe mentioning case studies or giving more detailed examples would help. This would give the reader a more anchored intro to the discipline rather than just the book and might pique more interest.
Should the reader install software on their own machine? Is JupyterHub available? This book is for the true novice, so the reader will need more guidance and the video assumes they have this worked out.
there are a lot of links throughout to worksheets / videos / etc that seem to require the student to be in the course (worksheet_02.ipynb etc). Perhaps a "how to read this book" section in the preface or intro?

Edit this comment directly to summarize / synthesize the major (important) comments across reviewers. If there's already the same comment below, add the reviewer to the parenthetical list at the beginning.

Synthesis:

[ ] We need to include exercises (that's easy -- just include tut + wksht) (+1 Reviewer E and B)
[ ] Additional resources (in all chapters) should be annotated to explain what students should hope to learn from reading them. Beyond annotations of what is in the resource, but explain what topic they should focus on next. ~~Add a "where to go from here" chapter?~~ Rebut: will do better at the end of each chapter
[ ] Expand the brief introduction paragraph in Chapter 1 to be more clear about what the book covers. ~~Rev D suggests a whole chapter for this.~~ Rebuttal needed here. Rev A suggests just expanding the paragraph. (Reviewer C) Introduction seems most problematic… lots of things covered at a high speed and they are repeated (reiterated?) in later chapters… e.g. grammar of graphics – at least some understanding of basic PRINCIPLES, rather then these are the steps to follow would be good.
- tell people how to read the book (might want to read the system setup and version control chapters first)
- add data science workflow diagram to beginning of chapter 1, & expand text to make it clearer what is going on
- move select & filter to wrangling (chapter 3) and make sure this doesn't negatively affect chapter 2
[ ? ] we could put tables throughout the text to highlight the overall set of "options" for a given task (see the detailed comment on Ch4 Viz for an example of what RevD means)
The book makes a lot of reference to doing things on JHub / JLab. How should the reader follow these things? Some reviewers like the focus on JHub/Lab, some really don't. Some want more reasoning behind it. (Reviewer C) move Jupyter-related content to an appendix on Jupyter, and add an appendix section on RStudio? These are to two-most common platforms for using R (would address review F's first global concern). This aligns with the comment from C: Version control sections' reliance on the Jupyter Lab Git Extension... RStudio has a extension too... Move the collaboration section to the end of the book (+ 1 Reviewer E: consider removing references to JupyterHub or moving them to an appendix. All of the book content is highly relevant apart from any specific analytical platform, so this could increase the book’s appeal to a wider audience.)
- [ ] Move JHub stuff to end of book (will precede moving to your own machine)
- [ ] Move vc to end of book
There are a lot of links to external material (videos, worksheets, etc) that won't make sense for the general reader (not dsci100 students), and won't translate to print format. This will be helped a bit if we include tuts and wkshts along with the book. But the print format videos and 3d plots etc still need fixing.
- [ ] UI stuff becomes videos with stable links
- [ ] Screen shots get replaced with more conceptual diagrams
[ ] make a better title (it's too dry) - revisit title discussion we had earlier
rename "chapter learning objectives" to "skill building objectives"? (Reviewer E) I would move the point-form learning objectives to an appendix: they really are intended for other instructors rather than for learners. We will rebut
[ ] need to update to the newest dplyr and ggplot2 - Tiffany to check (reviewer A/E?)
[ ] fix the index (a glossary of functions and terms would be good) - also check whether this can be autogenerated by bookdown? Talk to Laura (CRC Press) for help on this? (+1 Reviewer E add glossary of main functions discussed to the appendix.)
[ ] (Reviewer C) Just teach one way to do things (although we can acknowledge there are many ways to do things): to do: find where we have shown alternatives and make sure we are clear the way we want them to do it with tidyverse
[ ] (Reviewer C) datasets may need a bit more introduction. For the future – it may be good to stick with less datasets – to demonstrate how the same data can answer different questions? Rebuttal: generally, we like the richness of that data sets, but we need more information/context about the data sets. Point out where things have been simplified? Maybe move chapter 2 to using only canlang. Go through each chapter and find where we can just have one data set. Chapter 4 needs multiple data sets by the way we have written it. Think about clustering chapter - can we use one that we are already using.
- (+ 1 Reviewer E) fatiguing that every single example uses a different and completely unrelated dataset. Reader spends more time thinking about the data itself than about the methods being taught, and the lack of emphasis on explaining data makes the dangerous subtle implication that data science can be done without a deep understanding of data and domain. rather see a small set of datasets to use consistently throughout the book and explicitly introduce these at the beginning or perhaps convert the whole book to continually point back to a single more realistic case study. How to address: minimum - explicit statement owning up to this, that we did this to teach ds, and IRL ds should not be done without a domain expert. Or you practice DS in your domain.
(Reviewer C) Machine learning principles need a chapter on its own (e.g. training/testing datasets, metrics and evaluation of models, bias/variance, etc.). Explaining the logic behind tidymodels also would be great – at this stage authors just “jump” into it.
- (+1 Reviewer E) Before beginning modeling, I would like to see the book describe the model building process (data wrangling, picking an algorithm, tuning it with training data, evaluating it on test data, etc.) While all of the components are described, it may be hard for new readers to clearly separate out data, algorithms, parameters, and evaluation metrics without a framework. This could include some of the great content about evaluation that is currently in Chapter 7 but not specific to classification.
- [ ] Add high-level sections to 6 (process - train & evaluate, 6 will cover train, 7 covers evaluate) & 8 (acknowledge that this is a repeat of 6 & 7 with a new flavour), add a wrap-up/summary at the end of 9 that brings everything together in context (we could introduce the term supervised learning, and hint that the next chapter covers a new type of analysis for a different problem, that is called unsupervised learning)
(Reviewer C) Introduction to statistical inference feels out of place, after Cluster analysis chapter it feels a bit too late…. Rebuttal: beginner level textbook, the idea is that this is the next thing we would study in detail. We made a pedagogical decision for this book and audience to have a very low bar for math/stat/etc)
(Reviewer C) The book feels more like a reference manual or.. workshops materials… It is assumed that “theory” and foundations are covered somewhere else. Some fundamental knowledge and principles are missing – for example, Regression I: K-nearest neighbours section misses the explanation of the method and it is the first part of the regression method.. It just starts with a working example assuming that the readers “get” what the authors are referring to.
- [ ] Lead into regression more gentle and rely less on past knowledge about classification (8.3 is a bit bizarre, should combine with 8.4)
(Reviewer C) Some basic introduction of data types is needed, e.g. introduction refers to data frame but “what is this?????” – why should your readers know??? Aligns with this comment from E: - What is a data frame section: Feels like ths should have happened at least one chapter earlier…
- [ ] move into chapter 3? In section 3.3.2 What is a vector?
  (from Advanced R: https://adv-r.hadley.nz/vectors-chap.html#s3-atomic-vectors)
(Reviewer C) Pace: I think the material is too fast and too dense for someone to work through on their own. I think that in a lab setting, with a tutor there to answer questions and elaborate on the provided examples, it would be fine, but as a stand-alone learning resource, it would be a real challenge for most of the learners I’ve had. (This is particularly true of the chapter on Git, which is famously difficult to teach.) Rebut and point to the movement of Git chapter to end, and added explanations and conceptual diagrams throughout, also point out worksheet questions.
- [ ] include the worksheets (not tutorials)
[ ] (Reviewer C) Consolidation: there are many places where it feels like the chapters were written independently and then stitched together – some things are covered twice, some things are used before they are defined, etc. Sometimes this is OK (I don’t think the authors need to explain what the ‘min()’ function does before using it), but in general, I think that going through the first five chapters and making a point-form list of every new concept or function introduced, then checking the ordering, would help a lot. Making the index will help with us addressing this
[ ] (Reviewer C) There are also places where it feels like the book is written for other instructors rather than for students – I have noted these in the PDF. Go to individual chapters and address where this has been highlighted by the reviewer.
[ ] (Reviewer C) I think the book has tremendous potential, but as I’ve said above, many readers (particularly those without a strong programming background) will find it extremely dense. More examples, more diagrams, less jargon, and fewer forward references or unexplained concepts and terms will help a lot. Go through book and look for forward references or unexplained concepts, also, we are adding more diagrams (etc version control chapter), rebut - worksheets allow them to revisit concepts in a more interactive and interesting way.
(Reviewer C) I recommend moving the material on web scraping much later in the book - learners are going to be struggling with the tidyverse at this point, and now you’re adding HTML and CSS etc. to their cognitive load. I agree it needs to be covered, but not here… -(+ Reviewer E) Given that understanding of students’ backgrounds, the only section that felt very out of place to be is the part on web scraping. As I elaborate in 11 (see here #102), the prevalence of Javascript pages makes this harder to do with rvest alone, and the increasing prevalence of APIs make this less necessary. This also requires a lot of extra knowledge (e.g. of HTML and CSS) that seem beyond the presumed background. Rebut: this is extra/bonus material, we will make this more clear), also students love this content, so we don't want to remove it
- [ ] add subsection on web API's (we could use this one: https://cran.r-project.org/web/packages/cancensus/vignettes/cancensus.html)
(Reviewer E): textbook feels more like course notes than a stand-alone text e.g.:
- [ ] references to the video content could be cleaned up
- more frameworks could be described for things such as the modeling process ????????
- [ ] it’s less clear what is meant to illustrate an example versus the universe of possibilities -- worth being more specific about things that are just high-level examples (e.g. KNN versus all classification algorithms). For example, make it clear what parts of chapter 7 are relevant to classification in general, and which are relevant to just k-nn
[ ] (Reviewer E) book does not take a very opinionated stance on some of the real-world skills needed to succeed in data science (e.g. Ch 4 took a strong and opinionated stance on the right way to effectively explain a visualization). would find the whole book significantly more valuable if this voice were present throughout. (Reviewer E) - tying together intro to tidyverse (like R for Data Science) and a classic introduction to modelling (like ISLR) and showing the complete workflow. book can be crisper on each step of the process and provide a view on the algorithmic, technical, and creative sides of data science. My favorite parts of the book came when the authors took a strong and opinionated stance (e.g. on the right way to effectively explain a visualization), and I would find the whole book significantly more valuable if this voice were present throughout.
- Rebuttal: TC: I disagree with this comment. Viz is inherently a more subjective topic, and I feel that the approach we took was strongly opinionated in an effort to argue that students should follow our approach. The other bits of the book are less subjective, as there is a definitive "right" and "wrong" way to do those things, and so less opinionated argumentation is necessary.
(Reviewer E) I think the book could benefit by discussing feature engineering. I see a common misconception with early-stage data scientists that they can simply model off of whatever fields happen to appear in their table. This leaves a lot of power on the table. Critically thinking about what features to include in a model is particularly important with algorithms like KNN (a focus of the book) for which performance can be significantly degraded with irrelevant predictors.. Additionally, this is a way to link the skills learned in the first and second halves of the book.
- [ ] add some guidance on feature engineering at the end of the regression chapter
(Reviewer E) While the book may not explain many different algorithms, could do more to acknowledge their existence. readers might come away thinking that classification is KNN, regression is linear modeling, clustering is K-means, and inference is comparison of means. It might be good to list other common algorithms in each chapter and even briefly mention when alternatives might be preferred. Will be addressed by one of the action items above
(Reviewer E) Chapter 5 felt somewhat misplaced in terms of placement and content.
- breaks the narrative flow of focusing of data analysis and analytical tools -- might work better in an introduction and set-up set of chapters (perhaps combined with Chapter 12) or in an appendix.
- However, content may not be useful in a book format. Jupyter Notebooks may be a particularly hard type of file to use to teach this skill since they aren’t plaintext and don’t produce clean diffs.
- discussion of GitHub relies on screenshots that risk going stale and doesn’t describe the most common GitHub usage patterns (via command line). Such platforms maintain good documentation, and theirs is less likely to go stale as they change their interface. If you do wish to give instructions, consider describing how to use the command line or GitHub Desktop, as the method shown (uploading manually via the UI) is not sustainable or very representative of common usage. Will be addressed by one of the action items above
(Reviewer E) Some of the examples do a great job of “spiraling” and iteratively mixing new and previously learned concepts step-by-step. This is especially true in the ggplot2 section. This is a fantastic approach, and I would love to see it applied more consistently. Unfortunately, many parts of the book function in silos; for example, none of the modeling sections rely much on previous teaching of dplyr for EDA which may leave students wondering why they even learned that content. Rebuttal: we don't do this to focus on the modelling stuff, but this happens in the worksheets.
[ ] Create a putting it all together chapter (build off this: https://github.com/ttimbers/breast_cancer_predictor), minimum for classification, maximum for all - important to put it after the modelling chapters (leave some complexities in these examples, like cleaning data, etc)
(Reviewer E) I also do notice that all examples are definitely “toy” examples and dodge a good bit of the real world complexity of data analysis. This is likely by design because it helps students focus on key concepts, but it might be worth acknowledging some of the simplifications (e.g. data cleaning, feature engineering, checking for missing data, constraining assumptions of different algorithms) so students are not blindsided when they encounter these in practice. Will be addressed by one of the action items/rebuttals above
(Reviewer E) This book is definitely on the shorter side, but that may be particularly appealing to students and make it seem very manageable for use in either a course or self-study. That said, throughout this review, I do suggest some other potential types of content, and I do not think making the book longer to cover more ground would be a problem either. If the book does grow in length, the author could mitigate any downside by laying out different “learning paths” or highlighting required versus optional chapters in the introduction. Will be addressed by one of the action items/rebuttals above
[ ] Reviewer D comment re: inductive vs deductive approach - to do: add a summary of where we are going for each walkthrough example in the book (especially 6, 7, 8).

Reviewer F

Overreliance on Jupyter notebooks (e.g. RStudio cloud used to be free and then restrictions were introduced to push sales to educational institutions) – there should be an option for RStudio desktop as an alternative. Eventually, you want students to use RStudio as they progress in their career, JP may not be a permanent option.
Some explanations even though they are marked as option create a confusion – for example on p7 reference to the base functions… just say – there are many ways to do the same thing!
Introduction seems most problematic… lots of things covered at a high speed and they are repeated (reiterated?) in later chapters… e.g. grammar of graphics – at least some understanding of basic PRINCIPLES, rather then these are the steps to follow would be good.
Some basic introduction of data types is needed, e.g. introduction refers to data frame but “what is this?????” – why should your readers know???
- Chapters are quite “technical” and does assume some knowledge as well as “homework” preparation using other resources, suggested by the authors though.
Chapters are filled with examples – which is fantastic, but some key knowledge is missing. E.g. visualization chapter just sends the reader to other textbook to read how to build visualizations and goes into details of what “effective visualizations are”. You can lose your reader in that “other” textbook and they never come back for the examples here.
Machine learning principles need a chapter on its own (e.g. training/testing datasets, metrics and evaluation of models, bias/variance, etc.). Explaining the logic behind tidymodels also would be great – at this stage authors just “jump” into it. Introduction to statistical inference feels out of place, after Cluster analysis chapter it feels a bit too late….
Appropriate, but datasets may need a bit more introduction. For the future – it may be good to stick with less datasets – to demonstrate how the same data can answer different questions.
The book feels more like a reference manual or.. workshops materials… It is assumed that “theory” and foundations are covered somewhere else. Some fundamental knowledge and principles are missing – for example, Regression I: K-nearest neighbours section misses the explanation of the method and it is the first part of the regression method.. It just starts with a working example assuming that the readers “get” what the authors are referring to.
The book is quite technical and does assume some experience. Could be a good reference book to “structure” and “organize” existing knowledge. Has too much “read this if you want more details” or “understand this”. The content is “to the point” but can be hard to “digest” for newbies.
Is the author’s approach to the subject appropriate for the intended level of the target audience? Well, for outstanding students in the cohort – it is a brilliant manuscript, but for the rest of the crowd (which may be the majority)…
Good book, one of the best manuscripts in terms of concise coverage, organized flow and structure and care about Teaching&Learning process. However, the book has a common issue for textbooks – cover everything in the concise form.

Reviewer C

Pace: I think the material is too fast and too dense for someone to work through on their own. I think that in a lab setting, with a tutor there to answer questions and elaborate on the provided examples, it would be fine, but as a stand-alone learning resource, it would be a real challenge for most of the learners I’ve had. (This is particularly true of the chapter on Git, which is famously difficult to teach.)
Consolidation: there are many places where it feels like the chapters were written independently and then stitched together – some things are covered twice, some things are used before they are defined, etc. Sometimes this is OK (I don’t think the authors need to explain what the ‘min()’ function does before using it), but in general, I think that going through the first five chapters and making a point-form list of every new concept or function introduced, then checking the ordering, would help a lot.
There are also places where it feels like the book is written for other instructors rather than for students – I have noted these in the PDF.
I believe most readers will want more examples – many important topics are shown (or explained) at most once.
I would like many more diagrams (as opposed to plots or screenshots). This would be particularly helpful when explaining the layers in ggplot2 graphics, and in explaining the lifecycle of Git commits, pushes, pulls, and merges.
I think more examples are needed to illustrate key points.
The bullet lists for things like image file formats and other version control systems feel like instructors’ notes rather than teaching material.
I would move the point-form learning objectives to an appendix: they really are intended for other instructors rather than for learners.
More diagrams please.
I think the book has tremendous potential, but as I’ve said above, many readers (particularly those without a strong programming background) will find it extremely dense. More examples, more diagrams, less jargon, and fewer forward references or unexplained concepts and terms will help a lot.
I recommend moving the material on web scraping much later in the book - learners are going to be struggling with the tidyverse at this point, and now you’re adding HTML and CSS etc. to their cognitive load. I agree it needs to be covered, but not here…
I think the reliance on videos in this book is going to be problematic, but I don’t have an easy alternative to offer.
What is a data frame section: Feels like ths should have happened at least one chapter earlier…
Version control sections' reliance on the Jupyter Lab Git Extension... RStudio has a extension too...

Action items

Revision of synthesis into action items. We might want to pull these off into individual issues we can close as we address them, and assign folks to them.

Major

Revise chapter 1 (introduction): Expand the brief introduction paragraph in chapter 1 to be more clear about what the book covers. Specifically:
- tell people how to read the book (might want to read the system setup ~~and version control chapters first~~)
- add data science workflow diagram to beginning of chapter 1, & expand text to make it clearer what is going on
- move select & filter to wrangling (chapter 3) and make sure this doesn't negatively affect chapter 2
- move Jupyter content to System setup/Setting up your computer chapter
This addresses comments by Rev D, Rev A & C. Need to rebut D's ask for a whole new chapter here. MAJOR
Move version control chapter to the end of the book and revise chapter to be more conceptual. Add more conceptual content to the vc chapter, and diagrams (like these), and move the screenshots to a screen cast with a stable link. Might want to consider doing both a Jupyter Git Extension demo, and an RStudio one. This addresses comments made by Rev C & F. MAJOR
Simplify and better explain data sets. Where we can, provide more information/context about the data sets (maybe in a call out box or something?). Also, make it clear where things have been simplified and why (so we can focus on the data science method we are teaching). At a minimum, we need to explicitly state that data science cannot be done without a deep understanding of data and domain, and that we are approaching things the way we are to teach data science, and IRL data science should not be done without a domain expert, or alternatively, it is common to practice data science in your domain of expertise.* Go through each chapter and find where we can just have one data set. Idea: see if we can have chapter 2 only use canlang data sets (might not work for web scraping, but maybe there's a more related data set? Note: Chapter 4 needs multiple data sets by the way we have written it. Question: Think about clustering chapter - can we use one that we are already using? This addresses comments made by Rev C & E, but we do need to also generate a rebuttal here stating why we have chosen a rich set of data sets for this book. MAJOR
Draft a putting it all together chapter. Create a putting it all together chapter, where we demonstrate an entire DS workflow, from reading data , to EDA, to modelling, and communicating the results. We can build off a project Tiffany has created for MDS: https://github.com/ttimbers/breast_cancer_predictor. At a minimum we do this for a classification example, at a maximum we do this also for all modelling methods in the book. Or some intermediate goal. MAJOR
Move Jupyter-related content to System-setup chapter: Rename "Moving to your own machine" to "System setup" (or something related? Like "Setting up your computer?"?) and move any Jupyter-related content there. We can then link to it from other chapters if needed. Bonus: can we also explain how to get setup and use Rmd with RStudio so our book can support both major DS literate code document platforms? Or at a minimum link out to other good resources on this (risk: they don't come back to us...). UI (how to use Jupyter & Rmd) stuff becomes videos with stable links. Make sure videos are general enough for the book, and not specific for this course. This addresses comments made by Rev C, E. MAJOR
Revise supervised learning chapters.
- Add high-level sections to 6 (process - train & evaluate, 6 will cover train, 7 covers evaluate) & 8 (acknowledge that this is a repeat of 6 & 7 with a new flavour)
- add a wrap-up/summary at the end of 9 that brings everything together in context (we could introduce the term supervised learning, and hint that the next chapter covers a new type of analysis for a different problem, that is called unsupervised learning)
- add some guidance on feature engineering at the end of the regression chapter
- list other common algorithms in each chapter and even briefly mention when alternatives might be preferred
This will address comments made by Rev C & E. MAJOR

Minor/major

Fix/improve index. We need a robust index for this book. Check whether this can be autogenerated by bookdown? Talk to Laura (CRC Press) for help on this if we need to? Also, once we create the index, we want to create a glossary of the main terms and functions. Let's consider using the glossario R package for this, and borrowing from the Carpentries English glossary? This addresses comments made by Rev E. MINOR/MAJOR?
Ensure book is written for intended audience. Read through reviewer C's annotations and address highlighted parts where book appears to be written for other instructors rather than for students. This will address comments made by Rev C. MINOR/MAJOR?
Summaries of where we are going. Read through the book, and ensure there is a summary of where we are going for each walkthrough example in the book (especially 6, 7, 8). This will address comments made by Rev D. MINOR/MAJOR
Clarify examples from the the universe of possibilities. Read through the book, and clarify where what we are discussing is meant to illustrate an example versus the universe of possibilities -- worth being more specific about things that are just high-level examples (e.g. KNN versus all classification algorithms). For example, make it clear what parts of chapter 7 are relevant to classification in general, and which are relevant to just k-nn MINOR/MAJOR

Minor

Stable domain and links. Get a domain, that we can use to come up with stable links for linking to videos and worksheets from the textbook. MINOR
Exercises/worksheets. Create repository for just worksheets to be associated with the textbook. (Bonus 1: Perhaps we can use Binder or a Public JupyterHub to make them interactive? Bonus 2: Add GitHub Actions to the repo to use Jupytext to autogenerate Rmd's of the worksheets for folks who'd prefer to work with Rmd instead of Jupyter?) Point to the relevant worksheet at the end of each chapter using a stable link. Remember, when we point to the worksheets, to also point to the system setup chapter so they can first follow the instructions of setting up Jupyter on their own machine. (+1 Reviewer E and B) MINOR
Improve additional resources sections. Add a few sentences to give context to each additional resource we share. This means going just beyond annotating them (we should also do that), but also explain what topic they should focus on next. Rebut: We will not add a where to go from here chapter, asked by reviewer ?, but will instead do better job at the end of each chapter, on a topic-by-topic basis. MINOR
Update to the newest dplyr and ggplot2. Check most recent updates to dplyr and ggplot2 and make sure we are using the most up to date syntax in the books. If this needs to change, also change in the worksheets. This addresses comments made by Rev A & E. MINOR
Ensure we just teach one way to do things. Read through the book and check that we only teach one way to do things (although we can acknowledge there are many ways to do things): Specifically, find where we have shown alternatives and make sure we are clear the way we want them to do it with tidyverse. This addresses comments made by Rev C. MINOR
Improve regression introduction. Lead into regression more gentle and rely less on past knowledge about classification (8.3 is a bit bizarre, should combine with 8.4). This will address comments made by Rev C. MINOR
Add vector data types explanations. Add base R vector types (logical, integer, double, character) introduction and explanation to section 3.3.2 ("What is a vector?") in chapter 3. Add factor vector type explanation in chapter 4 (visualization chapter), when we need it. This will address comments made by Rev C & E. MINOR
Forward references & unexplained concepts. Go through book and look for forward references or unexplained concepts. MINOR
Make it clear web scraping is optional/add API's. Add a note to make it clear that web scraping is optional. Fix the wrong definition of web scraping. Add a subsection on web API's (we could use this one: https://cran.r-project.org/web/packages/cancensus/vignettes/cancensus.html). MINOR

Rebuttals to write

These things we are purely rebutting, not changing.

rename "chapter learning objectives" to "skill building objectives"? (Reviewer E) I would move the point-form learning objectives to an appendix: they really are intended for other instructors rather than for learners. We will rebut
(Reviewer C) Introduction to statistical inference feels out of place, after Cluster analysis chapter it feels a bit too late…. Rebuttal: beginner level textbook, the idea is that this is the next thing we would study in detail. We made a pedagogical decision for this book and audience to have a very low bar for math/stat/etc)
[ ] (Reviewer E) book does not take a very opinionated stance on some of the real-world skills needed to succeed in data science (e.g. Ch 4 took a strong and opinionated stance on the right way to effectively explain a visualization). would find the whole book significantly more valuable if this voice were present throughout. (Reviewer E) - tying together intro to tidyverse (like R for Data Science) and a classic introduction to modelling (like ISLR) and showing the complete workflow. book can be crisper on each step of the process and provide a view on the algorithmic, technical, and creative sides of data science. My favorite parts of the book came when the authors took a strong and opinionated stance (e.g. on the right way to effectively explain a visualization), and I would find the whole book significantly more valuable if this voice were present throughout.
- Rebuttal: TC: I disagree with this comment. Viz is inherently a more subjective topic, and I feel that the approach we took was strongly opinionated in an effort to argue that students should follow our approach. The other bits of the book are less subjective, as there is a definitive "right" and "wrong" way to do those things, and so less opinionated argumentation is necessary.
(Reviewer E) Some of the examples do a great job of “spiraling” and iteratively mixing new and previously learned concepts step-by-step. This is especially true in the ggplot2 section. This is a fantastic approach, and I would love to see it applied more consistently. Unfortunately, many parts of the book function in silos; for example, none of the modeling sections rely much on previous teaching of dplyr for EDA which may leave students wondering why they even learned that content. Rebuttal: we don't do this to focus on the modelling stuff, but this happens in the worksheets.
(Reviewer C) I think the book has tremendous potential, but as I’ve said above, many readers (particularly those without a strong programming background) will find it extremely dense. More examples, more diagrams, less jargon, and fewer forward references or unexplained concepts and terms will help a lot. Go through book and look for forward references or unexplained concepts, also, we are adding more diagrams (etc version control chapter), rebut - worksheets allow them to revisit concepts in a more interactive and interesting way.
- (Reviewer C) I recommend moving the material on web scraping much later in the book - learners are going to be struggling with the tidyverse at this point, and now you’re adding HTML and CSS etc. to their cognitive load. I agree it needs to be covered, but not here… -(+ Reviewer E) Given that understanding of students’ backgrounds, the only section that felt very out of place to be is the part on web scraping. As I elaborate in 11 (see here #102), the prevalence of Javascript pages makes this harder to do with rvest alone, and the increasing prevalence of APIs make this less necessary. This also requires a lot of extra knowledge (e.g. of HTML and CSS) that seem beyond the presumed background. Rebut: this is extra/bonus material, we will make this more clear), also students love this content, so we don't want to remove it
- [ ] add subsection on web API's (we could use this one: https://cran.r-project.org/web/packages/cancensus/vignettes/cancensus.html)

Still need to decide what to do for these:

[ ] make a better title (it's too dry) - revisit title discussion we had earlier
[ ? ] we could put tables throughout the text to highlight the overall set of "options" for a given task (see the detailed comment on Ch4 Viz for an example of what RevD means)

@ttimbers @leem44 this is done now, right?

Closing; we can re-open if needed.

UBC-DSCI / introduction-to-datascience

Review: Global/big picture content revisions #95