IDEMSInternational / R-Instat

A statistics software package powered by R
http://r-instat.org/
GNU General Public License v3.0
38 stars 103 forks source link

Reporting an interesting problem in the Data Science book? #8573

Open rdstern opened 1 year ago

rdstern commented 1 year ago

@lilyclements could you please check that the error I think I have found in a dataset in the Introduction to Data Science book is real and not my imagination. Even check in RStudio perhaps?

The problem is a well-known example of Simpson's paradox, described here.

It is a 1973 example and the data are in datasets as UCBAdmissions. Nice. Even nicer for me is that thay are also in the dslabs package, under admissions, which supplies all the data used in the 2 books - and which we will use on our AIMS course.

So I read the data into R-Instat and here it is:

image

I claim the 3rd column is very obviously wrong! One value is very obviously wrong, in women admitted to major in B 68 were admitted out of the 25 who applied! And in general overall about 37% or 44% were admitted and the numbers in the 3rd column are obviously too small.

Am I importing wrongly into R-Instat? Is there a bug somewhere. If not we can combine a message to the author, with perhaps a request that he adds the correct data, while leaving a copy of these also available. It is a great example of the importance of checking the data, no matter where it comes from!

And we could tell him also how we will be using his book in our forthcoming AIMS course.

David has now checked and I think we could write! On his page for the book it is very easy. If you click on Report an Issue you go straight to github - and we will be the first issue! David agrees, that perhaps you could write?

rdstern commented 1 year ago

I think I have another interesting issue for the research funding rates, This is described in Chapter 15, section 15.10 of the book. The program is given here (from here on my machine, namely: C:\Program Files\R-Instat\0.7.16\static\R\library\dslabs\script (this includes the full script for the researcg funding rates, and they are given in sections in the book.

library(tidyverse)
library(tidyr)
library(stringr)
library(readr)

## Download the table
library("pdftools")
temp_file <- tempfile()
url <- "http://www.pnas.org/content/suppl/2015/09/16/1510159112.DCSupplemental/pnas.201510159SI.pdf"
download.file(url, temp_file)
txt <- pdf_text(temp_file)
file.remove(temp_file)

raw_data_research_funding_rates <- txt[2]

save(raw_data_research_funding_rates, file="data/raw_data_research_funding_rates.rda", compress="xz")

## Get the names
tab <- str_split(raw_data_research_funding_rates, "\n")[[1]]
the_names_1 <- tab[3]
the_names_2 <- tab[4]

the_names_1 <- the_names_1 %>%
  str_trim() %>%
  str_replace_all(",\\s[n|%]", "") %>%
  str_split("\\s{2,}", simplify = TRUE)

the_names_2 <- the_names_2 %>%
  str_trim() %>%
  str_split("\\s+", simplify = TRUE)

tmp_names <- str_c(rep(the_names_1, each = 3), the_names_2[-1], sep = "_")
the_names <- c(the_names_2[1], tmp_names) %>%
  str_to_lower() %>%
  str_replace_all("\\s", "_")

## Create the table
research_funding_rates <- tab[6:14] %>%
  str_trim %>%
  str_split("\\s{2,}", simplify = TRUE) %>%
  data.frame(stringsAsFactors = FALSE) %>%
  setNames(the_names) %>%
  mutate_at(-1, parse_number)

save(research_funding_rates, file="data/research_funding_rates.rda", compress="xz")

However, it doesn't run (at least in R-Instat) and I think the problem is that the URL on line 9 no longer give the table. It is possible to get the table via that url interactively, but it is now an html table instead. I think mit will also give an error in RStudio and I can no longer find a pdf version of the table!

I have just made a nice practical of this problem. I think the code could be adapted, because the data file is supplied anyway. I looked for that file.

Perhaps this is another item to report?