UBC-DSCI / introduction-to-datascience

Open Source Textbook for DSCI100: Introduction to Data Science in R
https://datasciencebook.ca/
Other
50 stars 54 forks source link

Review: Ch 2 (reading) #102

Closed leem44 closed 2 years ago

leem44 commented 3 years ago

Reviewer E:

~- In 2.4.2: when mention data being read in with default names, it might be good to introduce base R names() or dplyr::rename()~ -~In 2.4.3: Long URL spills out of code box~ ~- In 2.4.3.: “This is also demonstrated in the video below” -- consider adding conditional logic so this only shows up in the web version and not the PDF. Alternatively, since video content won’t be available to readers, consider if supplemental explanation is need~

leem44 commented 3 years ago

Reviewer B:

ML: no changes needed here

trevorcampbell commented 3 years ago

Reviewer D

Data File Type R Function R Package
.csv read_csv() readr
.xlsx read_excel() readxl

  • ~Then the authors can explain that each of these functions has several options which allow the user to specify the name and path of the datafile, etc. (I feel that starting directly by explaining what a path is misses the point that the path is an important option of a data import function.~
  • ~Would the authors consider mentioning the here package of R for specifying paths more robustly across projects?~
  • ~The section on “Reading tabular data from a plain text file into R” should make it clear that what is actually read into R is a comma separated file.~
  • ~Should “Reading data from an Microsoft Excel file” actually be “Reading data from a Microsoft Excel file”?~
trevorcampbell commented 3 years ago

Reviewer A

  • page 25: Should readers have access to this (worksheet_02.ipynb)? If so, I think clearly stating this somewhere is key. I agree that knowing the difference between absolute and relative file paths is critical. ~It might be worth mentioning the here package somewhere, even if it's not used~
    • ML: mentioned this in another issue #125
  • ~on p31, at the beginning or 2.4.3: I expected to see this discussion (of URLs) after relative vs. absolute file paths. Is it in the right spot here?~
  • ~Full URL doesn't fit on the page, consider using a URL shortener or posting the CSV in another location~
  • page 33: The video says you can click on the data to open it in the editor, but this doesn't match the text. Be sure to update the video before publication
    • ML: mentioned this in another issue #189
  • ~p35: "if the xlsx has multiple sheets..." Not much detail here, but a nice pointer. Are there exercises that help the reader explore this?~
  • p 39: The reader won't be able to connect to this DB, so warn them!
    • TC: we should just point them to a DB they can connect to... -ML: mentioned this in another issue #187
  • ~footnote on p43: This URL isn't sufficient~
  • ~top of p47 "In the case of this example node": It might be worth slowing down here and having the reader extract this string, then working through it before putting the pieces together. The html is too complicated.~
ttimbers commented 3 years ago

Reviewer C

  • ~Overview: I’m finding this construction a bit awkward - can you introduce the term “tabular data” earlier (I think it’s more common)?~
  • section 2.3: I think this is misleading: there’s one location that is described by many paths. (If you say “the path is the place”, then it’s hard to explain multiple relative paths.)
  • Fig 2.1: Tree diagrams of filesystems make sense to programmers, but I think most normal people are used to indented lists as shown in file browsers…
    • ML: I don't quite agree with the above two comments, but happy to discuss if others disagree
  • ~udacity video link (Maybe put these into “further reading and viewing” at the end of the chapter for better findability?)~
  • ~section 2.4: "and here’s the term I’m looking for…" and "1. Have you explained the difference between these? 2. Is it important? 3. If it isn’t, can you use just one term??~
  • Note in section 2.4: OK, this is the explanation I ask for earlier… Can you please show some of the messages so they know what you’re referring to?
    • ML: mentioned this in another issue #188
  • ~section 2.4.1: I’m often asked at this point, “How can I tell how many lines to skip?” The answer is, “You have to read the file yourself and hard-code it,” but I think it needs to be said.~
  • ~section 2.4.2: Is this the right time to introduce this? I think it will confuse learners until they have more practice reading plain old CSV files…~
    • ML: added an example on tsv files to familiarize students with more vanilla read_ options yet is different from read_csv*
  • ~section 2.4.2: Do you think your intended learners will understand \t at this point? Escape characters haven’t been introduced and aren’t obvious…~
  • ~section 2.4.3: urlL > URL please (it’s an acronym)~
  • ~section 2.5: Will the intended audience ever have worked with a plain text editor? (I often have learners who haven’t, and who can’t easily open a .xlsx file in any application that doesn’t understand it…)~
  • ~End of section 2.5: “Why is that sad?” asked the learner…~
  • ~section 2.6 where we try to explain the difference between the dbplyr object (before collect) and a data frame: I don’t think your intended audience will understand the difference, since it looks like they just got the data. A diagram showing the copying may help.~ -~Note in 2.6.1.1: "So what am I supposed to do in that case?” asked the learner, feeling slightly frightened by the warning but unsure how to proceed.~
  • ~section 2.6.1.2: I don’t know what “artificial database” means.~
  • ~section 2.6.2: “So what’s the point of dbplyr if I have to bring things into memory most of the time anyway?” asked the learner.~
  • ~"Why should we bother with a database at all?" section: I think this needs to be moved higher, before the disclaimers about “this function doesn’t work with dbplyr, so you have to load the data”.~
    • ML: didn't move this but added a line about the benefits since some things in this section may be hard for readers to understand without reading the other sections first
  • ~Section 2.8: Will your intended learners understand the use of ‘*’ here?~
  • ~Section 2.8: I don’t think of using APIs as scraping - in fact, I think it’s the opposite of scraping.~
  • ~Section 2.8.2: All caps and bold is a bit shouty…~
  • ~Note in section 2.8.2: I usually avoid labeling things “interesting” - that’s up to the reader to decide.~
  • Section 2.8.2: If you’re not going to scrape Craigslist, I suggest making that decision much earlier and using something else to explain selectors rather than getting this far and then pulling the plug.
  • section 2.8.3: It takes me a hard two hours in class to cover this with people who aren’t already web programmers - I think the authors are being very optimistic squashing it into a few pages like this.
    • ML: I think we discussed this comment above and opted to include it
leem44 commented 2 years ago

these are all addressed