Robinlovelace / spatial-microsim-book

Code, data and prose of the book: Spatial Microsimulation with R
https://www.crcpress.com/Spatial-Microsimulation-with-R/Lovelace-Dumont/p/book/9781498711548
MIT License
114 stars 77 forks source link

ipfp function generate NaN weights #141

Closed PJPJPJPJPJ closed 6 years ago

PJPJPJPJPJ commented 6 years ago

I am using the ipfp package to generate a synthetic population for 100 zones, a total of 523,930 people. My sample size is 24,000. I have 4 constraints: age (5 levels), sex (2 levels), race (4 levels) and employment status (3 levels)。

I have made sure that the totals across different constraints are consistent. I basically use the same code that's provided by the code to do the microsimulation.

weights_ga <- array(NA, dim=c(nrow(ind_ga),nrow(cons_ga)))

ga_ind_agg <- matrix(colSums(ga_ind_cat), nrow(cons_ga), ncol(cons_ga), byrow = T)

# Iterative proportional fitting (IPF) stage
library(ipfp) # load the ipfp package -may need install.packages("ipfp")
cons_ga <- apply(cons_ga, 2, as.numeric) # convert the constraints to 'numeric'

ga_ind_catt <- t(ga_ind_cat) # transpose the dummy variables for ipfp

x0 <- rep(1, nrow(ind_ga)) # set the initial weights

weights_ga_f <- apply(cons_ga, 1, function(x) ipfp(x, ga_ind_catt, x0))

Created on 2018-08-16 by the reprex package (v0.2.0).

`` And my weights looks like this

screen shot 2018-08-16 at 6 27 35 pm

What could be possibly going wrong? I have several 0s in the constraints, but there are not many of them. I tried to replace them to 0.0001, but still doesn't work.

Robinlovelace commented 6 years ago

Could you create a minimal reproducible example? Without that it's hard to debug.

PJPJPJPJPJ commented 6 years ago

Hi Dr. Lovelace, here's a reproducible example. My data was too big, so I randomly sampled a very very small portion of it.

age4 <- c(346,81,435)

age5_19 <- c(580,420,1730)

age20_34 <- c(726,354,1321)

age35_64 <- c(1296,823,2567)

age65 <- c(2028,1112,2883)

male <- c(2448,1349,4273)

female <- c(2528,1441,4663)

White <- c(3464,2646,3912)

Black <- c(1443,7,3449)

Hispanic <- c(14,93,310)

Other_race <- c(55,44,1265)

employed <- c(1998,1476,4543)

unemployed <- c(116,51,319)

not_laborf <- c(2862,1263,4074)

cons_gaga <- data.frame(age4,age5_19,age20_34,age35_64,age65,male,female,
                        White,Black,Hispanic,Other_race,employed,unemployed,not_laborf)

age4 <- c(0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1)

age5_19 <- c(1,0,0,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0)

age20_34 <- c(0,0,0,1,1,0,1,0,0,0,0,0,0,1,0,0,0,1,1,0,0,1,1,0,0,1,1,0,1,1,1,1,1,0,0,0,0,1,0,1,0,0,1,1,0,1,0,0,1,1,1,0,1,0,0,0,0,1,0,1,0,1,1,0,0,0,0,1,0,0,0,1,0,1,1,0,1,1,1,0,1,0,1,0,0,0,0,1,0,1,0,0,1,1,1,0,0,1,0,0,1,0,1,0,1,0,0,0,1,0,1,0,0,0,1,0,1,1,0,1,1,1,1,1,1,1,0,1,1,0,0,0,0,1,0,0,0,1,0,1,0,1,1,1,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,1,1,0,1,0,0,1,1,1,0,1,0,0,0,1,0,0,0,1,1,0,1,0,0,1,1,0,1,1,1,0,1,0,0,1,0,0,0,0,1,0)

age35_64 <- c(0,1,0,0,0,1,0,0,0,1,0,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,1,0,1,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,1,0,0,1,0,0,0,0,1,1,0,0,0,1,0,1,1,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,0)

age65 <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)

male <- c(0,0,1,1,1,1,1,1,0,0,1,0,0,0,0,1,1,0,1,1,0,0,0,1,0,1,1,0,0,1,0,0,1,1,0,1,0,1,1,1,1,1,1,1,0,0,0,0,1,0,0,0,0,0,1,1,0,0,0,0,1,0,0,1,0,1,1,0,0,0,1,1,1,1,1,0,1,1,0,0,1,1,1,0,0,0,0,0,0,0,0,1,0,1,1,0,1,1,1,0,1,0,1,1,0,0,0,0,0,0,0,0,1,1,0,1,1,0,1,0,0,0,0,1,1,1,1,0,0,0,0,0,1,1,1,0,1,0,1,1,1,1,0,1,0,0,1,1,0,1,0,1,0,0,0,0,1,1,1,0,0,1,0,0,1,1,0,1,1,0,0,1,1,1,0,1,0,0,0,0,1,0,1,0,0,1,0,1,1,0,0,1,1,0,0,0,0,1,1,1)

female <- c(1,1,0,0,0,0,0,0,1,1,0,1,1,1,1,0,0,1,0,0,1,1,1,0,1,0,0,1,1,0,1,1,0,0,1,0,1,0,0,0,0,0,0,0,1,1,1,1,0,1,1,1,1,1,0,0,1,1,1,1,0,1,1,0,1,0,0,1,1,1,0,0,0,0,0,1,0,0,1,1,0,0,0,1,1,1,1,1,1,1,1,0,1,0,0,1,0,0,0,1,0,1,0,0,1,1,1,1,1,1,1,1,0,0,1,0,0,1,0,1,1,1,1,0,0,0,0,1,1,1,1,1,0,0,0,1,0,1,0,0,0,0,1,0,1,1,0,0,1,0,1,0,1,1,1,1,0,0,0,1,1,0,1,1,0,0,1,0,0,1,1,0,0,0,1,0,1,1,1,1,0,1,0,1,1,0,1,0,0,1,1,0,0,1,1,1,1,0,0,0)

White <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0)

Black <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0)

Hispanic <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,0,0,0,0)

Other_race <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1)

employed <- c(1,0,0,1,0,0,1,1,0,0,0,1,0,0,0,0,0,1,0,1,0,1,1,0,0,0,1,0,1,1,0,1,1,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,1,1,0,0,0,0,1,0,1,0,0,0,1,0,0,1,1,0,1,0,0,0,0,1,1,0,0,1,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,1,0,0,0,1,0,1,1,0,1,0,0,0,1,0,1,0,0,0,1,1,0,0,1,0,1,1,0,1,0,1,1,0,1,0,0,0,0,0,1,0,0,1,0,1,1,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,1,0,1,0,0,1,0,1,0,1,0,0,1,0,0,0,0,1,0,1,0,0)

unemployed <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0)

not_laborf <- c(0,1,1,0,1,1,0,0,1,1,1,0,1,1,1,1,1,0,1,0,1,0,0,1,1,1,0,1,0,0,1,0,0,0,1,1,1,1,1,0,1,0,0,1,1,1,1,1,0,0,0,1,1,1,1,0,1,0,1,1,1,0,1,1,0,0,1,0,1,0,1,1,0,0,1,0,0,0,0,1,1,0,0,1,0,1,1,1,1,1,1,1,0,0,1,1,1,0,1,0,1,0,0,0,0,1,0,1,1,1,0,1,0,1,1,1,0,0,1,1,0,1,0,0,1,0,1,0,0,1,0,1,1,1,1,1,0,1,0,0,1,0,0,1,1,0,0,1,1,1,1,1,1,0,1,1,1,1,0,1,0,1,1,1,1,1,1,0,1,0,1,1,1,0,1,1,1,0,0,1,0,1,1,0,1,0,1,0,0,1,0,1,1,1,1,0,1,0,1,1)

gaga_ind_cat <- data.frame(age4,age5_19,age20_34,age35_64,age65,male,female,White,Black,Hispanic,Other_race,employed,unemployed,not_laborf)

weights_gaga <- array(NA, dim=c(100,3))

gaga_ind_agg <- matrix(colSums(gaga_ind_cat), 3, 14, byrow = T)

library(ipfp) # load the ipfp package -may need install.packages("ipfp")

cons_gaga <- apply(cons_gaga, 2, as.numeric) # convert the constraints to 'numeric'

gaga_ind_catt <- t(gaga_ind_cat) # transpose the dummy variables for ipfp

x0 <- rep(1, 100) # set the initial weights

weights_gaga_f <- apply(cons_gaga, 1, function(x) ipfp(x, gaga_ind_catt, x0))

Created on 2018-08-18 by the reprex package (v0.2.0).

Thank you!

Robinlovelace commented 6 years ago

Heads-up: I'm making progress. See reprex based on http://spatial-microsim-book.robinlovelace.net/data-prep.html#flattening :+1:

devtools::install_github("ropensci/gistr")
#> Using GitHub PAT from envvar GITHUB_PAT
#> Skipping install of 'gistr' from a github remote, the SHA1 (61bd7fad) has not changed since last install.
#>   Use `force = TRUE` to force installation
source("https://gist.githubusercontent.com/Robinlovelace/b9c0207ca465885120ccfeaa70380545/raw/a760522c8aaea42c7e7e7b145d4a6849abb5ad6f/ipfp-qestion.R")

library(tidyverse)

# see http://spatial-microsim-book.robinlovelace.net/data-prep.html#flattening
sum(con_age)
#> Error in eval(expr, envir, enclos): object 'con_age' not found
sum(con_sex)
#> Error in eval(expr, envir, enclos): object 'con_sex' not found
sum(con_eth)
#> Error in eval(expr, envir, enclos): object 'con_eth' not found
sum(con_occ)
#> Error in eval(expr, envir, enclos): object 'con_occ' not found
colSums(con_age)
#> Error in is.data.frame(x): object 'con_age' not found
colSums(con_age)
#> Error in is.data.frame(x): object 'con_age' not found

cons = as.data.frame(cons_gaga)
ind_cat = gaga_ind_cat

# update to match names used in http://spatial-microsim-book.robinlovelace.net/smsimr.html#ipfinr
names(cons)
#>  [1] "age4"       "age5_19"    "age20_34"   "age35_64"   "age65"     
#>  [6] "male"       "female"     "White"      "Black"      "Hispanic"  
#> [11] "Other_race" "employed"   "unemployed" "not_laborf"
con_age = select_at(cons, .vars = vars(contains("age")))
con_sex = select_at(cons, .vars = vars(contains("ale")))
con_eth = select(cons, matches("[A-Z]", ignore.case = FALSE))
con_occ = select(cons, matches("emp|lab"))

rowSums(con_age) == rowSums(con_sex)
#> [1] TRUE TRUE TRUE

dim(cons)
#> [1]  3 14

colSums(ind_cat) # view the aggregated version of ind
#>       age4    age5_19   age20_34   age35_64      age65       male 
#>         26         36         88         50          0         95 
#>     female      White      Black   Hispanic Other_race   employed 
#>        105          7        187          3          3         71 
#> unemployed not_laborf 
#>          9        120
ind_agg <- colSums(ind_cat) # save the result
rbind(cons[1,], ind_agg) # test compatibility of ind_agg and cons
#>   age4 age5_19 age20_34 age35_64 age65 male female White Black Hispanic
#> 1  346     580      726     1296  2028 2448   2528  3464  1443       14
#> 2   26      36       88       50     0   95    105     7   187        3
#>   Other_race employed unemployed not_laborf
#> 1         55     1998        116       2862
#> 2          3       71          9        120

# from http://spatial-microsim-book.robinlovelace.net/smsimr.html#ipfinr
# Create intuitive names for the totals
n_zone <- nrow(cons) # number of zones
n_ind <- nrow(ind_cat) # number of individuals
n_age <-ncol(con_age) # number of categories of "age"
n_sex <-ncol(con_sex) # number of categories of "sex"

# Create initial matrix of categorical counts from ind 
weights <- matrix(data = 1, nrow = nrow(ind_cat), ncol = nrow(cons))
# create additional weight objects
weights1 <- weights2 <- weights 
ind_agg0 <- t(apply(cons, 1, function(x) 1 * ind_agg))
colnames(ind_agg0) <- names(cons)

# Assign values to the previously created weight matrix 
# to adapt to age constraint
for(j in 1:n_zone){
  for(i in 1:n_age){
    index <- ind_cat[, i] == 1
    weights1[index, j] <- weights[index, j] * con_age[j, i] 
    weights1[index, j] <- weights1[index, j] / ind_agg0[j, i]
  }
  print(weights1[1:2, ])
}
#>          [,1] [,2] [,3]
#> [1,] 16.11111    1    1
#> [2,] 25.92000    1    1
#>          [,1]     [,2] [,3]
#> [1,] 16.11111 11.66667    1
#> [2,] 25.92000 16.46000    1
#>          [,1]     [,2]     [,3]
#> [1,] 16.11111 11.66667 48.05556
#> [2,] 25.92000 16.46000 51.34000

# Create additional ind_agg objects
ind_agg2 <- ind_agg1 <- ind_agg0 * NA

# Assign values to the aggregated data after con 1
for(i in 1:n_zone){
  ind_agg1[i, ] <- colSums(ind_cat * weights1[, i])
}

rowSums(ind_agg1[, 1:2]) # the simulated populations in each zone
#> [1]  926  501 2165
rowSums(cons[, 1:2]) # the observed populations in each zone
#> [1]  926  501 2165

Created on 2018-08-20 by the reprex package (v0.2.0).

Robinlovelace commented 6 years ago

Update: continuing from where the code left off it continues to work with the 'IPFinR' technique but I can reproduce the issue. I don't think this is a problem with the book. Your input data does not work with ipfp, I do not know why. Closing for now as it's not an issue with the book.

for(j in 1:n_zone){
  for(i in 1:n_sex + n_age){
    index <- ind_cat[, i] == 1
    weights2[index, j] <- weights1[index, j] * cons[j , i] /
      ind_agg1[j, i]
  }
}

for(i in 1:n_zone){
  ind_agg2[i, ] <- colSums(ind_cat * weights2[, i])
}

cor(vec(ind_agg2), vec(cons))

library(ipfp) # load ipfp library after install.packages("ipfp")
cons <- apply(cons, 2, as.numeric) # to 1d numeric data type
ipfp(cons[1,], t(ind_cat), x0 = rep(1, n_ind)) # run IPF - generates NaNs
Robinlovelace commented 6 years ago

To confirm: it works in the book. I just ran all of chapter 5 code up to this point and I get:

> ipfp(cons[1,], t(ind_cat), x0 = rep(1, n_ind)) # run IPF
[1] 1.227998 1.227998 3.544004 1.544004 4.455996
Robinlovelace commented 6 years ago

I suggest you a) try changing the input dataset to identify the problem (it may be an issue of 'empty cells' - certain combinations of individual characteristics may be missing) b) try running IPF manually as illustrated above or c) try another package for reweighting such as mipfp. Hope that helps. I'm confident that some combination of the above suggestions will solve the problem. Let me know how you get on. Any further feedback on the book welcome!

PJPJPJPJPJ commented 6 years ago

Thank you very much, Dr. Lovelace. I am still trying out the suggestions and codes you gave. I will update with you with the progress.