MichaelChirico / r-bugs

A ⚠️read-only⚠️mirror of https://bugs.r-project.org/
20 stars 0 forks source link

[BUGZILLA #6902] Bug in update()? #2134

Closed MichaelChirico closed 4 years ago

MichaelChirico commented 4 years ago

From: Berwin A Turlach <berwin@<::CENSORED -- SEE ORIGINAL ON BUGZILLA::>> Dear all,

I noticed the following while playing around with fitting log-linear models to contingency tables using R 1.8.1, but the problem also exists under R 1.9.0.

A reproducible example uses the following contingency table:

library(MASS)
data(quine)
tmp <- with(quine, expand.grid(Eth=levels(Eth), Sex=levels(Sex), 

+ Lrn=levels(Lrn), Age=levels(Age)))

n <- nrow(quine)
quine2 <- with(quine,

+ data.frame(tmp, + Count=as.vector(tapply(rep(1,n), list(Eth, Sex, Lrn, Age), sum))))

First fit a saturated model and see which term we can drop:

fm <- glm(Count ~ .^4, quine2, family=poisson)
drop1(fm, test="Chisq")
Single term deletions

Model:
Count&sim; .^4
                Df  Deviance     AIC     LRT Pr(Chi)
<none>             6.661e-16 148.916                
Eth:Sex:Lrn:Age  2     0.422 145.338   0.422    0.81

Drop the four-way interaction and check which term we can drop next:

fm <- update(fm, . ~ . - Eth:Sex:Lrn:Age)
drop1(fm, test="Chisq")
Single term deletions

Model:
Count&sim; .
       Df Deviance     AIC     LRT  Pr(Chi)   
<none>      35.893 142.810                    
Eth     1   36.332 141.248   0.439 0.507811   
Sex     1   37.238 142.154   1.345 0.246237   
Lrn     1   37.392 142.308   1.499 0.220842   
Age     3   49.818 150.735  13.925 0.003009 **
&#45;&#45;&#45;
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

All of a sudden there are only the main effects left and all interaction terms have been removed. I expected to see at this point a test for all three-way interactions. Since I started of with the saturated model and used update to remove the four-way interaction, all two-way and three-way interactions should still be in the model. Or am I missing some subtlety of the "^" and "." operators in model formulae?

On the other hand, if we specify the saturated model as follows:

fm <- glm(Count ~ Lrn*Eth*Sex*Age, quine2, family=poisson)

then the above steps have the desired effect:

drop1(fm, test="Chisq")
Single term deletions

Model:
Count&sim; Lrn * Eth * Sex * Age
                Df  Deviance     AIC     LRT Pr(Chi)
<none>             4.219e-15 148.916                
Lrn:Eth:Sex:Age  2     0.422 145.338   0.422    0.81
fm <- update(fm, . ~ . - Eth:Sex:Lrn:Age)
drop1(fm, test="Chisq")
Single term deletions

Model:
Count&sim; Lrn + Eth + Sex + Age + Lrn:Eth + Lrn:Sex + Eth:Sex + 
    Lrn:Age + Eth:Age + Sex:Age + Lrn:Eth:Sex + Lrn:Eth:Age + 
    Lrn:Sex:Age + Eth:Sex:Age
            Df Deviance     AIC     LRT  Pr(Chi)   
<none>            0.422 145.338                    
Lrn:Eth:Sex  1    0.493 143.409   0.071 0.789909   
Lrn:Eth:Age  2    0.629 141.545   0.207 0.901471   
Lrn:Sex:Age  2   12.728 153.644  12.306 0.002127 **
Eth:Sex:Age  3    1.019 139.935   0.597 0.897103   
&#45;&#45;&#45;
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

The update command only removed the four-way interaction but left all the two-way and three-way interaction in the model.

I looked at update.formula to see what the problem might be, but after reading the help file of .Internal I stopped looking around.... :)

Cheers,

    Berwin

--please do not edit the information below--

Version: platform = i686-pc-linux-gnu arch = i686 os = linux-gnu system = i686, linux-gnu status = major = 1 minor = 9.0 year = 2004 month = 04 day = 12 language = R

Search Path: .GlobalEnv, package:MASS, package:methods, package:stats, package:graphics, package:utils, Autoloads, package:base


METADATA

MichaelChirico commented 4 years ago

From: Prof Brian Ripley <ripley@<::CENSORED -- SEE ORIGINAL ON BUGZILLA::>> Berwin,

It's fortuitous that this works at all.

The problem is not in update but in glm itself: take a look at fm$formula after the first fit. fm$terms has the right formula, but formula() extracts the $formula component first.

I am not currently sure what the right fix is, so will not try to fix this in R-patched/1.9.1.

Brian

On Fri, 21 May 2004 berwin@<::CENSORED -- SEE ORIGINAL ON BUGZILLA::> wrote:

Dear all,

I noticed the following while playing around with fitting log-linear
models to contingency tables using R 1.8.1, but the problem also
exists under R 1.9.0.

A reproducible example uses the following contingency table:

> library(MASS)
> data(quine)
> tmp <- with(quine, expand.grid(Eth=levels(Eth), Sex=levels(Sex), 
+                    Lrn=levels(Lrn), Age=levels(Age)))
> n <- nrow(quine)
> quine2 <- with(quine,
+       data.frame(tmp,
+       Count=as.vector(tapply(rep(1,n), list(Eth, Sex, Lrn, Age), sum))))

First fit a saturated model and see which term we can drop:
> fm <- glm(Count ~ .^4, quine2, family=poisson)
> drop1(fm, test="Chisq")
Single term deletions

Model:
Count ~ .^4
Df  Deviance     AIC     LRT Pr(Chi)
<none>             6.661e-16 148.916                
Eth:Sex:Lrn:Age  2     0.422 145.338   0.422    0.81

Drop the four-way interaction and check which term we can drop next:
> fm <- update(fm, . ~ . - Eth:Sex:Lrn:Age)
> drop1(fm, test="Chisq")
Single term deletions

Model:
Count ~ .
Df Deviance     AIC     LRT  Pr(Chi)   
<none>      35.893 142.810                    
Eth     1   36.332 141.248   0.439 0.507811   
Sex     1   37.238 142.154   1.345 0.246237   
Lrn     1   37.392 142.308   1.499 0.220842   
Age     3   49.818 150.735  13.925 0.003009 **
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

All of a sudden there are only the main effects left and all
interaction terms have been removed.  I expected to see at this point
a test for all three-way interactions.  Since I started of with the
saturated model and used update to remove the four-way interaction,
all two-way and three-way interactions should still be in the model.
Or am I missing some subtlety of the "^" and "." operators in model
formulae? 

On the other hand, if we specify the saturated model as follows:

> fm <- glm(Count ~ Lrn*Eth*Sex*Age, quine2, family=poisson)

then the above steps have the desired effect:

> drop1(fm, test="Chisq")
Single term deletions

Model:
Count ~ Lrn * Eth * Sex * Age
Df  Deviance     AIC     LRT Pr(Chi)
<none>             4.219e-15 148.916                
Lrn:Eth:Sex:Age  2     0.422 145.338   0.422    0.81
> fm <- update(fm, . ~ . - Eth:Sex:Lrn:Age)
> drop1(fm, test="Chisq")
Single term deletions

Model:
Count ~ Lrn + Eth + Sex + Age + Lrn:Eth + Lrn:Sex + Eth:Sex + 
Lrn:Age + Eth:Age + Sex:Age + Lrn:Eth:Sex + Lrn:Eth:Age + 
Lrn:Sex:Age + Eth:Sex:Age
Df Deviance     AIC     LRT  Pr(Chi)   
<none>            0.422 145.338                    
Lrn:Eth:Sex  1    0.493 143.409   0.071 0.789909   
Lrn:Eth:Age  2    0.629 141.545   0.207 0.901471   
Lrn:Sex:Age  2   12.728 153.644  12.306 0.002127 **
Eth:Sex:Age  3    1.019 139.935   0.597 0.897103   
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

The update command only removed the four-way interaction but left all
the two-way and three-way interaction in the model.  

I looked at update.formula to see what the problem might be, but after
reading the help file of .Internal I stopped looking around.... :)

Cheers,

Berwin

--please do not edit the information below--

Version:
platform = i686-pc-linux-gnu
arch = i686
os = linux-gnu
system = i686, linux-gnu
status = 
major = 1
minor = 9.0
year = 2004
month = 04
day = 12
language = R

Search Path:
.GlobalEnv, package:MASS, package:methods, package:stats, package:graphics,
package:utils, Autoloads, package:base

______________________________________________
R-devel@<::CENSORED -- SEE ORIGINAL ON BUGZILLA::> mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-devel

-- Brian D. Ripley, ripley@<::CENSORED -- SEE ORIGINAL ON BUGZILLA::> Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ <CENSORING FROM DETECTED PHONE NUMBER ONWARDS; SEE BUGZILLA>


METADATA

MichaelChirico commented 4 years ago

NOTES: Changed (for lm and glm) in 2.0.0-to-be


METADATA

MichaelChirico commented 4 years ago

Audit (from Jitterbug): Mon May 24 21:54:16 2004 ripley changed notes Mon May 24 21:54:16 2004 ripley moved from incoming to Models-fixed


METADATA

MichaelChirico commented 4 years ago

From: Berwin A Turlach <berwin@<::CENSORED -- SEE ORIGINAL ON BUGZILLA::>> Dear Brian,

>>>> "BDR" == Prof Brian Ripley <ripley@<::CENSORED -- SEE ORIGINAL ON BUGZILLA::>> writes:
BDR> The problem is not in update but in glm itself: take a look
BDR> at fm$formula after the first fit.  fm$terms has the right
BDR> formula, but formula() extracts the $formula component first.

O.k., I see what you mean.

BDR> I am not currently sure what the right fix is, so will not
BDR> try to fix this in R-patched/1.9.1.

This may be naive, but I noticed the following at the end of the glm function (starting line 76, in src/library/stats/R/glm.R):

fit <- c(fit, list(call = call, formula = formula,
           terms = mt, data = data,
           offset = offset, control = control, method = method,
           contrasts = attr(X, "contrasts"),
                   xlevels = .getXlevels(mt, mf)))

That is, the component "formula" in the object that is returned is explicitly set to the "formula" argument with which glm was called. The function "lm" does not do this. And if I change this `line' to

fit <- c(fit, list(call = call, 
           terms = mt, data = data,
           offset = offset, control = control, method = method,
           contrasts = attr(X, "contrasts"),
                   xlevels = .getXlevels(mt, mf)))

then fm$formula for the first fit in my example looks o.k.

But I don't know if this will break anything else.

Cheers,

    Berwin

PS: I noticed that both `glm' and 'lm' now do essentially the following manipulations early on:

mf <- match.call(expand.dots = FALSE)
m <- match(c("formula", "data", "subset", "weights", "na.action", 
    "offset"), names(mf), 0)
mf <- mf[c(1, m)]
mf$drop.unused.levels <- TRUE
mf[[1]] <- as.name("model.frame")
mf <- eval(mf, parent.frame())

Is this the preferred idiom to be used in R 1.9.x and later for modelling functions?


METADATA

MichaelChirico commented 4 years ago

From: Prof Brian Ripley <ripley@<::CENSORED -- SEE ORIGINAL ON BUGZILLA::>> On Wed, 26 May 2004, Berwin A Turlach wrote:

Dear Brian,

>>>>> "BDR" == Prof Brian Ripley <ripley@<::CENSORED -- SEE ORIGINAL ON BUGZILLA::>> writes:

BDR> The problem is not in update but in glm itself: take a look
BDR> at fm$formula after the first fit.  fm$terms has the right
BDR> formula, but formula() extracts the $formula component first.
O.k., I see what you mean.

BDR> I am not currently sure what the right fix is, so will not
BDR> try to fix this in R-patched/1.9.1.
This may be naive, but I noticed the following at the end of the glm
function (starting line 76, in src/library/stats/R/glm.R):

fit <- c(fit, list(call = call, formula = formula,
terms = mt, data = data,
offset = offset, control = control, method = method,
contrasts = attr(X, "contrasts"),
xlevels = .getXlevels(mt, mf)))

That is, the component "formula" in the object that is returned is
explicitly set to the "formula" argument with which glm was called.
The function "lm" does not do this.  And if I change this `line' to

fit <- c(fit, list(call = call, 
terms = mt, data = data,
offset = offset, control = control, method = method,
contrasts = attr(X, "contrasts"),
xlevels = .getXlevels(mt, mf)))

then fm$formula for the first fit in my example looks o.k.  

You won't have a formula component at all (lm objects do not). What I have done is alter the formula methods to pick up the formula from the terms component but the environment from the formula componet (if present).


But I don't know if this will break anything else.

Cheers,

Berwin

PS:  I noticed that both `glm' and 'lm' now do essentially the
following manipulations early on:

mf <- match.call(expand.dots = FALSE)
m <- match(c("formula", "data", "subset", "weights", "na.action", 
"offset"), names(mf), 0)
mf <- mf[c(1, m)]
mf$drop.unused.levels <- TRUE
mf[[1]] <- as.name("model.frame")
mf <- eval(mf, parent.frame())

Is this the preferred idiom to be used in R 1.9.x and later for
modelling functions?

Rather than removing components, by setting to NULL, yes.

-- Brian D. Ripley, ripley@<::CENSORED -- SEE ORIGINAL ON BUGZILLA::> Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ <CENSORING FROM DETECTED PHONE NUMBER ONWARDS; SEE BUGZILLA>


METADATA