NickCH-K / causaldata

Packages of Example Data for The Effect
131 stars 29 forks source link

Data Discrepancy between R and Python #4

Closed s3alfisc closed 1 year ago

s3alfisc commented 1 year ago

Hi Nick,

I have started to tests pyfixest on "real world datasets" and am replicating all code examples in "the effect".

I have noticed that there is a small error in how the data is processed / the source data between the Python and R version for one of the examples in chapter 13 on regression.

Here is a reproducible example:

Python:

from causaldata import restaurant_inspections
res = restaurant_inspections.load_pandas().data

R:

library(reticulate)
library(tidyverse)

res <- causaldata::restaurant_inspections

res <- res %>%
    # Create NumberofLocations
    group_by(business_name) %>%
    mutate(NumberofLocations = n())

py_res <- py$res

all.equal(res, py_res, check.attributes = FALSE)
# [1] "Component “NumberofLocations”: Mean relative difference: 0.6426152"

So either the source data set or the way the NumberofLocations variable is computed differ.

I have installed the most up-to-date version of all packages.

Best, Alex

NickCH-K commented 1 year ago

The NumberofLocations variable is externally-sourced information about the number of locations in the chain, not all of which may appear in the actual data itself (and additionally the same location may appear multiple times due to multiple inspections), so a count of the number of rows per business name shouldn't be expected to give the number of locations in the chain. You get the same result comparing against the NumberofLocations variable in the R version alone:

library(causaldata)
library(dplyr)

data(restaurant_inspections)

restaurant_inspections <- restaurant_inspections %>%
  group_by(business_name) %>%
  mutate(new_num_locations = n())

all.equal(restaurant_inspections$new_num_locations, restaurant_inspections$NumberofLocations, check.attributes = FALSE)
# [1] "Mean relative difference: 0.6426152"

So thankfully I think this one is okay! Thanks for checking though.

s3alfisc commented 1 year ago

Ah I see. Makes excellent sense. Thanks for the feedback! =)