UBC-DSCI / introduction-to-datascience

Open Source Textbook for DSCI100: Introduction to Data Science in R

Other

50 stars 54 forks source link

Reviewer E:

~- In 2.4.2: when mention data being read in with default names, it might be good to introduce base R names() or dplyr::rename()~ -~In 2.4.3: Long URL spills out of code box~ ~- In 2.4.3.: “This is also demonstrated in the video below” -- consider adding conditional logic so this only shows up in the web version and not the PDF. Alternatively, since video content won’t be available to readers, consider if supplemental explanation is need~

On databases:

~“Thankfully, you don’t need to know SQL” may imply SQL is harder than it is. Consider rephrasing to not scare students off of something they will almost definitely need to know if they pursue data analysis!~
~This section may require more motivation. Students may not be clear when or how they could encounter a database.~
~It might be worth showing an example of dbplyr::show_query() just to illuminate how dbplyr works and why not all R functions work with it. Although students will not know SQL, showing this could illuminate that R is in fact translating every line of code to another language~

On web scrapping: ~- In discussion of checking for legality of scraping, could mention R package {polite}~

~Remove text on p. 45 clearly meant for class notes not book -- “this is optional not required reading” -- and consider how to supplement formerly video content~
I would seriously consider if this section is necessary. There are so many great APIs these days (and many wrapper R packages!), I think that is potentially a more useful skill. Additionally, an increasing number of pages use so much Javascript that {rvest} alone does not work and Selenium is required.
- ML: I think we had a lot of discussion about this already and opted to include

~p.44: just FYI on tiny tipo, double “is is”~

Reviewer B:

I very much like how Timbers gives students a taste of web scraping. Web-scraping is a practice that more and more students are encountering earlier and earlier in their careers, where as “back in my day” it wasn’t. In my experience, students get very excited when they can scrape tables off Wikipedia, imdb, and other popular websites.

SQL!!! I always leave this as a throw-away topic at the end of my course and thus never cover it as well as I would like. Great to see how Timbers introduces it early.

ML: no changes needed here

Reviewer D

Reading in data locally and from the web -- “Locally” is a technical term – can it be replaced with “from your computer”

ML: I think "locally" is OK and we use the terms "remote" and "local" so I don't think we should change this

Can this chapter be split into three smaller chapters: Reading data into R locally; Reading or scraping data from the web; Reading data into R from a database; Exporting data from R?

ML: I don't think splitting this chapter would work well with the flow of the book so didn't change

Current Learning Objectives are too long and detailed – they need to be simplified

ML: I think detailed, specific learning outcomes are good, but happy to discuss if others disagree

~Chapter on Reading data into R locally would benefit from stating upfront what types of spreadsheet-like files will be read in (e.g., csv, .xlsx) and what R functions will be used to read these in. A little table would be handy.~

~~Data File Type~~	R Function	R Package
.csv	read_csv()	readr
.xlsx	read_excel()	readxl

~~Data File Type~~

R Function

R Package

.csv

read_csv()

readr

.xlsx

read_excel()

readxl

~Then the authors can explain that each of these functions has several options which allow the user to specify the name and path of the datafile, etc. (I feel that starting directly by explaining what a path is misses the point that the path is an important option of a data import function.~
~Would the authors consider mentioning the here package of R for specifying paths more robustly across projects?~
~The section on “Reading tabular data from a plain text file into R” should make it clear that what is actually read into R is a comma separated file.~
~Should “Reading data from an Microsoft Excel file” actually be “Reading data from a Microsoft Excel file”?~

Reviewer A

page 25: Should readers have access to this (worksheet_02.ipynb)? If so, I think clearly stating this somewhere is key. I agree that knowing the difference between absolute and relative file paths is critical. ~It might be worth mentioning the here package somewhere, even if it's not used~

ML: mentioned this in another issue #125

~on p31, at the beginning or 2.4.3: I expected to see this discussion (of URLs) after relative vs. absolute file paths. Is it in the right spot here?~

~Full URL doesn't fit on the page, consider using a URL shortener or posting the CSV in another location~

page 33: The video says you can click on the data to open it in the editor, but this doesn't match the text. Be sure to update the video before publication

ML: mentioned this in another issue #189

~p35: "if the xlsx has multiple sheets..." Not much detail here, but a nice pointer. Are there exercises that help the reader explore this?~

p 39: The reader won't be able to connect to this DB, so warn them!

TC: we should just point them to a DB they can connect to... -ML: mentioned this in another issue #187

~footnote on p43: This URL isn't sufficient~

~top of p47 "In the case of this example node": It might be worth slowing down here and having the reader extract this string, then working through it before putting the pieces together. The html is too complicated.~

Reviewer C

~Overview: I’m finding this construction a bit awkward - can you introduce the term “tabular data” earlier (I think it’s more common)?~
section 2.3: I think this is misleading: there’s one location that is described by many paths. (If you say “the path is the place”, then it’s hard to explain multiple relative paths.)
Fig 2.1: Tree diagrams of filesystems make sense to programmers, but I think most normal people are used to indented lists as shown in file browsers…
- ML: I don't quite agree with the above two comments, but happy to discuss if others disagree
~udacity video link (Maybe put these into “further reading and viewing” at the end of the chapter for better findability?)~
~section 2.4: "and here’s the term I’m looking for…" and "1. Have you explained the difference between these? 2. Is it important? 3. If it isn’t, can you use just one term??~
Note in section 2.4: OK, this is the explanation I ask for earlier… Can you please show some of the messages so they know what you’re referring to?
- ML: mentioned this in another issue #188
~section 2.4.1: I’m often asked at this point, “How can I tell how many lines to skip?” The answer is, “You have to read the file yourself and hard-code it,” but I think it needs to be said.~
~section 2.4.2: Is this the right time to introduce this? I think it will confuse learners until they have more practice reading plain old CSV files…~
- ML: added an example on tsv files to familiarize students with more vanilla read_ options yet is different from read_csv*
~section 2.4.2: Do you think your intended learners will understand \t at this point? Escape characters haven’t been introduced and aren’t obvious…~
~section 2.4.3: urlL > URL please (it’s an acronym)~
~section 2.5: Will the intended audience ever have worked with a plain text editor? (I often have learners who haven’t, and who can’t easily open a .xlsx file in any application that doesn’t understand it…)~
~End of section 2.5: “Why is that sad?” asked the learner…~
~section 2.6 where we try to explain the difference between the dbplyr object (before collect) and a data frame: I don’t think your intended audience will understand the difference, since it looks like they just got the data. A diagram showing the copying may help.~ -~Note in 2.6.1.1: "So what am I supposed to do in that case?” asked the learner, feeling slightly frightened by the warning but unsure how to proceed.~
~section 2.6.1.2: I don’t know what “artificial database” means.~
~section 2.6.2: “So what’s the point of dbplyr if I have to bring things into memory most of the time anyway?” asked the learner.~
~"Why should we bother with a database at all?" section: I think this needs to be moved higher, before the disclaimers about “this function doesn’t work with dbplyr, so you have to load the data”.~
- ML: didn't move this but added a line about the benefits since some things in this section may be hard for readers to understand without reading the other sections first
~Section 2.8: Will your intended learners understand the use of ‘*’ here?~
~Section 2.8: I don’t think of using APIs as scraping - in fact, I think it’s the opposite of scraping.~
~Section 2.8.2: All caps and bold is a bit shouty…~
~Note in section 2.8.2: I usually avoid labeling things “interesting” - that’s up to the reader to decide.~
Section 2.8.2: If you’re not going to scrape Craigslist, I suggest making that decision much earlier and using something else to explain selectors rather than getting this far and then pulling the plug.
section 2.8.3: It takes me a hard two hours in class to cover this with people who aren’t already web programmers - I think the authors are being very optimistic squashing it into a few pages like this.
- ML: I think we discussed this comment above and opted to include it