BoC-PaymentsResearch / CPMI_stats

5 stars 2 forks source link

gini_coefficient is breaking with "NAs produced by integer overflow" #1

Closed bedantaguru closed 5 years ago

bedantaguru commented 5 years ago

The sample data here

The system in which I'm running is eqquiped with 8 GB RAM.

> print(gini_coefficient(readRDS("data_masked.rds")))
Warning messages:
1: In df$num_payments[which(df$from == x)] * df$num_payments[which(df$from !=  :
  NAs produced by integer overflow
2: In df$num_payments[which(df$from == x)] * df$num_payments[which(df$from !=  :
  NAs produced by integer overflow
bedantaguru commented 5 years ago

I debugged the code It seems the code is breaking at two_sums call in agg_particips <- agg_particips[, .(sums_particip = two_sums(.SD)), by = .(date)]

bedantaguru commented 5 years ago

I modified the gini function as follows (I'm from Reserve Bank of India)

gini_coefficient_RBI_Mod <- function(payments_data, max_debit_data){

  if(missing(max_debit_data)){
    max_debit_data <- payments_data %>% distinct(from) %>% .[[1]] %>% map_df(~max_liq_prov(.x, payments_data, T))
  }

  payments_data %>% group_by(date, from) %>% summarise(P_j_s = sum(value), m_j_s =n()) %>% 
    inner_join(max_debit_data, by = c("date" = "date", "from" = "participant")) %>% 
    rename(L_j_s = max_net_pos) %>% mutate(l_j_s = L_j_s/P_j_s) -> gd

  gdd1 <- gd %>% group_by(date) %>% summarise(M_s = sum(m_j_s), miu_s = sum(P_j_s*l_j_s)/sum(P_j_s))

  gini_top <- function(x, wt){
    ts <- seq_along(x)
    n <- length(x)
    ts[-n] %>% map_dbl(~sum((x[seq(.x+1, n)]-x[.x])*wt[seq(.x+1, n)]*wt[.x])) %>% sum
  }

  gdd2 <- gd %>% arrange(date, l_j_s) %>% group_by(date) %>% summarise(gtop = gini_top(l_j_s, m_j_s))

  gdd <- gdd1 %>% inner_join(gdd2, by = "date")

  gdd %>% mutate(gini = gtop/(M_s^2*miu_s)) %>% select(date, gini)

}

It is working with our data.

derekbrito commented 5 years ago

Thank you for finding this error with the two_sums function and providing a solution. I have updated the function following your gini_top function. The code now works with the sample data that you have provided.