L1 of continuous variables seem to be incorrect?

mkiang commented 8 years ago

When running the following sample code on simulated data:

n_trt = 1000
n_untrt = 1000
prob_trt_male = .75
prob_untrt_male = .15
n_trt_males = sum(rbinom(n_trt, 1, prob_trt_male))
n_trt_females = n_trt - n_trt_males
n_untrt_males = sum(rbinom(n_trt, 1, prob_untrt_male))
n_untrt_females = n_untrt - n_untrt_males

fake <- data.frame(trt = c(rep(0, n_untrt), rep(1, n_trt)), 
                   sex = c(rep(0, n_untrt_females), rep(1, n_untrt_males), 
                           rep(0, n_trt_females), rep(1, n_trt_males)))
fake$old <- NA
for (i in seq_along(fake$trt)) {
    if (fake$trt[i] + fake$sex[i] == 2){
        fake$old[i] <- rbinom(1, 1, .1)
        fake$bp[i] <- rnorm(1, 100, 20)
    } else if (fake$trt[i] + fake$sex[i] == 0) {
        fake$old[i] <- rbinom(1, 1, .9)
        fake$bp[i] <- rnorm(1, 280, 50)
    } else {
        fake$old[i] <- rbinom(1, 1, .5)
        fake$bp[i] <- rnorm(1, 175, 8)
    }
}

imbalance(fake$trt, fake, drop = "trt")

Returns this output:

Multivariate Imbalance Measure: L1=0.954
Percentage of local common support: LCS=13.0%

Univariate Imbalance Measures:

    statistic   type    L1      min     25%      50%      75%      max
sex   -0.5850 (diff) 0.585  0.00000   0.000  -1.0000  -1.0000   0.0000
old    0.6410 (diff) 0.641  0.00000   1.000   1.0000   1.0000   0.0000
bp   143.4229 (diff) 0.000 85.02359 121.272 158.6741 153.2892 233.4842

The L1 of bp is 0.000, which seems impossible. I can replicate it with almost any continuous variable in R, but when running this in Stata, the L1 (both univariate and multivariate) seems appropriately calculated.

siacus commented 8 years ago

hi Mathew, I'll look into it. Can you please join the output of sessionInfo() ? thanks Stefano

Sent from my StrawBerry®

On 05 nov 2015, at 21:48, Mathew Kiang notifications@github.com wrote:

When running the following sample code on simulated data:

n_trt = 1000 n_untrt = 1000 prob_trt_male = .75 prob_untrt_male = .15 n_trt_males = sum(rbinom(n_trt, 1, prob_trt_male)) n_trt_females = n_trt - n_trt_males n_untrt_males = sum(rbinom(n_trt, 1, prob_untrt_male)) n_untrt_females = n_untrt - n_untrt_males

fake <- data.frame(trt = c(rep(0, n_untrt), rep(1, n_trt)), sex = c(rep(0, n_untrt_females), rep(1, n_untrt_males), rep(0, n_trt_females), rep(1, n_trt_males))) fake$old <- NA for (i in seq_along(fake$trt)) { if (fake$trt[i] + fake$sex[i] == 2){ fake$old[i] <- rbinom(1, 1, .1) fake$bp[i] <- rnorm(1, 100, 20) } else if (fake$trt[i] + fake$sex[i] == 0) { fake$old[i] <- rbinom(1, 1, .9) fake$bp[i] <- rnorm(1, 280, 50) } else { fake$old[i] <- rbinom(1, 1, .5) fake$bp[i] <- rnorm(1, 175, 8) } }

imbalance(fake$trt, fake, drop = "trt") Returns this output:

Multivariate Imbalance Measure: L1=0.954 Percentage of local common support: LCS=13.0%

Univariate Imbalance Measures:
statistic   type    L1      min     25%      50%      75%      max
sex -0.5850 (diff) 0.585 0.00000 0.000 -1.0000 -1.0000 0.0000 old 0.6410 (diff) 0.641 0.00000 1.000 1.0000 1.0000 0.0000 bp 143.4229 (diff) 0.000 85.02359 121.272 158.6741 153.2892 233.4842 The L1 of bp is 0.000, which seems impossible. I can replicate it with almost any continuous variable in R, but when running this in Stata, the L1 (both univariate and multivariate) seems appropriately calculated.

— Reply to this email directly or view it on GitHub.

mkiang commented 8 years ago

Sure. Thanks for prompt response:

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.1 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] tcltk     stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] cem_1.1.17      lattice_0.20-33

loaded via a namespace (and not attached):
[1] MASS_7.3-44         combinat_0.0-8      tools_3.2.2         nlme_3.1-122        grid_3.2.2         
[6] randomForest_4.6-12 MatchIt_2.4-21

mkiang commented 8 years ago

Perhaps this should be a completely separate issue, but it also appears in Stata 13 that imbalance performs complete case analysis and thus results in a different L1 than R if there is any missing data.

mkiang commented 8 years ago

Hi @siacus, just wondering if there was any movement in this area? I'm planning on taking a closer look over the next few weeks and possibly coding my own (slower, less generalized, less efficient) method — don't want to duplicate effort if you've spotted the problem.

siacus commented 8 years ago

Hi Mathew, I’m a bit late on this. I suggest you take a look at the come in CEM package and try to fix it. Then you send me the code and I’ll try to optimize and integrate in the new release of CEM with acknowledgment. What do you think? Stefano

Il giorno 26 feb 2016, alle ore 22:16, Mathew Kiang notifications@github.com ha scritto:

Hi @siacus https://github.com/siacus, just wondering if there was any movement in this area? I'm planning on taking a closer look over the next few weeks and possibly coding my own (slower, less generalized, less efficient) method — don't want to duplicate effort if you've spotted the problem.

— Reply to this email directly or view it on GitHub https://github.com/IQSS/cem/issues/1#issuecomment-189485967.

mkiang commented 8 years ago

Sure. I can give it a go but it'll take a while -- not a lot of bandwidth for the next couple weeks.

Sent from my iPhone

On Fri, Feb 26, 2016 at 6:53 PM -0800, "Stefano Iacus" notifications@github.com wrote:

Hi Mathew,

I’m a bit late on this.

I suggest you take a look at the come in CEM package and try to fix it. Then you send me the code and I’ll try to optimize and integrate in the new release of CEM with acknowledgment.

What do you think?

Stefano

Il giorno 26 feb 2016, alle ore 22:16, Mathew Kiang notifications@github.com ha scritto:

Hi @siacus https://github.com/siacus, just wondering if there was any movement in this area? I'm planning on taking a closer look over the next few weeks and possibly coding my own (slower, less generalized, less efficient) method — don't want to duplicate effort if you've spotted the problem.

—

Reply to this email directly or view it on GitHub https://github.com/IQSS/cem/issues/1#issuecomment-189485967.

— Reply to this email directly or view it on GitHub.

mkiang commented 8 years ago

After a few hours, I can't seem to find the issue here. My sense is it around the xtabs() command and continuous variables, but I don't have the bandwidth to look closer. Just going to leave it open, but noting I won't be able to investigate further.

IQSS / cem

L1 of continuous variables seem to be incorrect? #1