benbhansen-stats / propertee

Prognostic Regression Offsets with Propagation of ERrors, for Treatment Effect Estimation (IES R305D210029).
https://benbhansen-stats.github.io/propertee/
Other
2 stars 0 forks source link

regenerating `simdata` produces different results #143

Closed josherrickson closed 9 months ago

josherrickson commented 1 year ago

Ran into this odd behavior. If I re-run data-raw/simdata.R (without making any modifications), the data changes enough that results are different.

With the currently committed version:

> coefficients(summary(damod))
              Estimate   Std. Error       t value      Pr(>|t|)
(Intercept) -0.6513555 1.420229e-08 -4.586270e+07 5.747644e-296
o_fac2       0.9946219 1.618221e-01  6.146393e+00  2.239085e-07
o_fac3       0.5769457 1.064419e-08  5.420288e+07 4.358043e-299
o_fac4       0.5010489 1.998610e-01  2.506986e+00  1.603533e-02
z._o_fac1    0.7191443           NA            NA            NA
z._o_fac2   -0.2268169 2.988840e-01 -7.588794e-01  4.520647e-01
z._o_fac4    0.2885315 1.998610e-01  1.443661e+00  1.560811e-01
Warning message:
The following subgroups do not have sufficient degrees of freedom for standard error estimates and will be returned as NA: o_fac1, o_fac3 

After re-generating the data (it was last touched last year https://github.com/benbhansen-stats/propertee/commit/f2c2a5a21eb36344215495b81c5655950101c258):

> coefficients(summary(damod))
              Estimate Std. Error    t value     Pr(>|t|)
(Intercept) -0.6513555        NaN        NaN          NaN
o_fac2       0.9946219  0.1618221  6.1463927 2.239085e-07
o_fac3       0.5769457        NaN        NaN          NaN
o_fac4       0.5010489  0.1998610  2.5069865 1.603533e-02
z._o_fac1    0.7191443         NA         NA           NA
z._o_fac2   -0.2268169  0.2988840 -0.7588794 4.520647e-01
z._o_fac4    0.2885315  0.1998610  1.4436606 1.560811e-01
Warning messages:
1: The following subgroups do not have sufficient degrees of freedom for standard error estimates and will be returned as NA: o_fac1, o_fac3 
2: In sqrt(diag(covmat)) : NaNs produced

The two versions of simdata are not identical:

> head(simdata-simdataold)
  cid1 cid2 bid force z o dose             x             y
1    0    0   0     0 0 0    0  4.440892e-16  0.000000e+00
2    0    0   0     0 0 0    0 -2.220446e-16  1.110223e-16
3    0    0   0     0 0 0    0  4.440892e-16  2.775558e-17
4    0    0   0     0 0 0    0  0.000000e+00 -5.551115e-17
5    0    0   0     0 0 0    0 -8.673617e-19  2.081668e-17
6    0    0   0     0 0 0    0  0.000000e+00  2.220446e-16

I wonder if this is connected to #136. I looked briefly though R's NEWS but didn't see anything obvious related to a change in RNG.

I think this raises two issues:

  1. I should be rounding simdata to avoid this issue entirely. This may require some (many?) tweaks to tests.
  2. Why are we getting such different results with such a minor change in the input? I assume that's its some sort of numerical precision + singularity issue; one version of simdata is slightly askew of the other such that it's not perfectly singular.

I'm happy to startg addressing 1., @jwasserman2 can you look into 2.? I think this is also connected to the most recent discussion on #119 (or vice-versa, #119's issue is related to this.)

jwasserman2 commented 1 year ago

The issue specifically in the summary call is that since it takes the square root of the diagonals of the covariance matrix, it'll return NaN if an element is negative, even if it's essentially 0. I don't know what people typically round to and at what stage of computations they round things, but some sort of rounding would help the situation.

For me, the data that's currently committed to the repo gives me NaN's. What are the steps I need to take to get the version of simdata that won't give NA's?

josherrickson commented 1 year ago

The data currently checked in was giving me NA but not NaN.

Once I re-ran "data-raw/simdata.R" (without making any changes to the file), and then ran data(simdata), I get the version that produces both NA and NaN.

Remind me what computer you're using?

josherrickson commented 1 year ago

Here's my machine precision. What differs on yours?

> .Machine
$double.eps
[1] 2.220446e-16

$double.neg.eps
[1] 1.110223e-16

$double.xmin
[1] 2.225074e-308

$double.xmax
[1] 1.797693e+308

$double.base
[1] 2

$double.digits
[1] 53

$double.rounding
[1] 5

$double.guard
[1] 0

$double.ulp.digits
[1] -52

$double.neg.ulp.digits
[1] -53

$double.exponent
[1] 11

$double.min.exp
[1] -1022

$double.max.exp
[1] 1024

$integer.max
[1] 2147483647

$sizeof.long
[1] 8

$sizeof.longlong
[1] 8

$sizeof.longdouble
[1] 8

$sizeof.pointer
[1] 8

$sizeof.time_t
[1] 8
jwasserman2 commented 9 months ago

Address NaN's by converting any slightly negative vcovDA() diagonals to 0

jwasserman2 commented 9 months ago

^didn't do this; instead made .check_df_moderator_estimates do a better job of catching SE estimates without sufficient degrees of freedom