fivethirtyeight / data

Data and code behind the articles and graphics at FiveThirtyEight
https://data.fivethirtyeight.com/
Creative Commons Attribution 4.0 International
16.74k stars 10.94k forks source link

Friday the 13th and Holidays #226

Closed hathawayj closed 1 year ago

hathawayj commented 5 years ago

In this article you discuss using this data to calculate the percent change. You say that you remove the data from months that have major holidays.

It excludes any months in which the 6th, 13th or 20th day fell on a major holiday.

I can't quite get my numbers to match, and it is the Mondays that are the numbers that are way off. I am guessing that my major holiday list is different. Could you provide a list of the holidays you used for your filtering?

hathawayj commented 5 years ago

Hello.

dhruv100 commented 3 years ago

I want to try this issue please guide me as I am new

hathawayj commented 3 years ago

It has been a long time since I looked at this. What guidance do you want?

Here is how I started.

pacman::p_load(lubridate, timeDate, stringr, tidyverse, chron, ggthemes)

ssa <- read_csv("https://github.com/fivethirtyeight/data/raw/master/births/US_births_2000-2014_SSA.csv") %>% mutate(data = "ssa")
cdc <- read_csv("https://github.com/fivethirtyeight/data/raw/master/births/US_births_1994-2003_CDC_NCHS.csv") %>% mutate(data = "cdc")

byears <- sort(unique(c(ssa$year, cdc$year)))
hdays <- holidayNYSE(byears)
#hdays <- holiday(byears,c("USNewYearsDay", 1"USInaugurationDay", "USMLKingsBirthday", "USLincolnsBirthday", "USWashingtonsBirthday", "USMemorialDay", "USIndependenceDay", "USLaborDay", "USColumbusDay", "USElectionDay", "USVeteransDay", "USThanksgivingDay", "USChristmasDay", "USCPulaskisBirthday", "USGoodFriday") )

hdays <- tibble(year = year(slot(hdays, "Data")), month = month(slot(hdays, "Data")), date_of_month = mday(slot(hdays, "Data")), day_of_week = wday(slot(hdays, "Data"), week_start = 1, label = FALSE), holiday = TRUE)

dat <- bind_rows(ssa,cdc) %>% 
  left_join(hdays) %>%
  mutate(holiday = ifelse(is.na(holiday), FALSE, TRUE))

Their article says,

Births on the 13th of the month are lower than you’d expect, but especially on Fridays; the effect is smallest when the 13th falls on a weekend, when delivery wards are staffed more thinly and tend to schedule fewer births.x

This calculation is based on births data from the National Center for Health Statistics for 1994 through 2003, and for 2000 through 2014 from the Social Security Administration. It excludes any months in which the 6th, 13th or 20th day fell on a major holiday. It averages the data from both sources, which differ by a percentage point or two on total number of births and by less on the 13th-of-the-month effect, for the years in which they overlap.

pdat <- dat %>%
  filter(date_of_month %in% c(6, 13, 20)) %>%
  group_by(year, month, date_of_month, day_of_week, holiday) %>%
  # It averages the data from both sources,
  summarize(births = mean(births)) %>%
  # exclude months that have a holiday on 6th, 13th, or 20th
  mutate(ym = str_c(year,"_", date_of_month)) %>%
  filter(ym %in% ym[holiday == FALSE]) %>%
  group_by(date_of_month) %>%
  summarize(n = sum(births), Monday = sum(births[day_of_week == 1]), Tuesday = sum(births[day_of_week == 2]),
            Wednesday = sum(births[day_of_week == 3]), Thursday = sum(births[day_of_week == 4]), 
            Friday = sum(births[day_of_week == 5]), Saturday = sum(births[day_of_week == 6]),
            Sunday = sum(births[day_of_week == 7])) %>%
  ungroup() %>%
  mutate(thirteenth = ifelse(date_of_month == 13, "Thirteenth", "Not")) %>%
  group_by(thirteenth) %>%
  summarise_at(vars(n:Sunday),"mean") %>%
  gather(key = "weekday", value = "value", -n, -thirteenth) %>%
  select(-n) %>%
  spread(key = thirteenth, value = value) %>%
  mutate(pdiff = (Thirteenth - Not)/Not, weekday = factor(weekday, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")))

pdat %>%
  ggplot(aes(x = weekday, y = pdiff)) +
  geom_col(fill = "hotpink") +
  theme_gray()