famuvie / breedR

Statistical methods for forest genetic resources analysts
http://famuvie.github.io/breedR/
GNU General Public License v3.0
31 stars 24 forks source link

Using character variables in a model #93

Open eityeity opened 6 years ago

eityeity commented 6 years ago

Hi everyone,

I'm a new user of breedR from Japan.

I have a question about the class of categorical variable in fixed factors.

I have 4 site data set, and the site name is character.

When I set variable 'site' as character, like d$site = as.character(d$site)

the result of res = remlf90(fixed = height ~ site , random = ~ code , data =d) has "Intercept", like

summary(res)

Formula: height ~ 0 + Intercept + site + code Data: d AIC BIC logLik 1106 1115 -551.1

Parameters of special components:

Variance components: Estimated variances S.E. code 0.3804 0.11533 Residual 0.5516 0.04156

Fixed effects: value s.e. Intercept 3.93742 0.1537 site.A -0.93131 0.1300 site.B -1.12276 0.1299 site.C -2.30688 0.1338 site.Z 0.00000 0.0000

When I set variable 'site' as factor, like 'd$site = as.factor(d$site)', the result don't have "Intercept", as follows.

Formula: height ~ 0 + site + code Data: d AIC BIC logLik 1108 1117 -552.1

Parameters of special components:

Variance components: Estimated variances S.E. code 0.3804 0.11533 Residual 0.5516 0.04156

Fixed effects: value s.e. site.A 3.0061 0.1329 site.B 2.8147 0.1243 site.C 1.6305 0.1366 site.D 3.9374 0.1537

In help of remlf90(),

... and there are no other categorical covariates in fixed. The latter condition is actually a limitation of (ai)remlf90 backends, which would in any case return an estimate for each level of the categorical covariates while returning 0 for the intercept.

Can I trust the "Intercept" estimated by charactered categorical variable in fixed effect?

Sincerely

famuvie commented 6 years ago

Hi, yes, you can trust it. It is the same model parameterised differently. Yet, I recommend using the factor specification in general. First, because it is conceptually the right way of declaring a categorical variable. Second, because it avoids the numerical degeneracy of adding one collinear parameter and setting it to zero.

Still, I will leave this issue open as a minor bug. breedR should either not run a model with a character variable, or automatically convert it to factor.

On the other hand, I don't see why one of your levels gets renamed from D to Z. I could not reproduce it:

library(breedR)
#> Loading required package: sp
N <- 1e4
dat <- 
  data.frame(
    site = gl(4, k = N/4, labels = LETTERS[1:4]),
    height = rep(10*(1:4), each = N/4) + rnorm(N)
  )

res_fac <- remlf90(
  fixed = height ~ site,
  data = dat
)
#> Using default initial variances given by default_initial_variance()
#> See ?breedR.getOption.

res_chr <- remlf90(
  fixed = height ~ site,
  data = transform(dat, site = as.character(site))
)
#> Using default initial variances given by default_initial_variance()
#> See ?breedR.getOption.

summary(res_fac)
#> Formula: height ~ 0 + site 
#>    Data: dat 
#>    AIC   BIC logLik
#>  28512 28520 -14255
#> 
#> 
#> Variance components:
#>          Estimated variances    S.E.
#> Residual               1.011 0.01431
#> 
#> Fixed effects:
#>          value   s.e.
#> site.A  9.9987 0.0201
#> site.B 20.0432 0.0201
#> site.C 30.0013 0.0201
#> site.D 39.9651 0.0201
summary(res_chr)
#> Formula: height ~ 0 + Intercept + site 
#>    Data: transform(dat, site = as.character(site)) 
#>    AIC   BIC logLik
#>  28510 28518 -14254
#> 
#> 
#> Variance components:
#>          Estimated variances    S.E.
#> Residual               1.011 0.01431
#> 
#> Fixed effects:
#>             value   s.e.
#> Intercept  0.0000 0.0000
#> site.A     9.9987 0.0201
#> site.B    20.0432 0.0201
#> site.C    30.0013 0.0201
#> site.D    39.9651 0.0201
eityeity commented 6 years ago

Hi,

Thanks. I am relieved. It will a issue on which the zero is assigned, on Intercept or on one of levels in categorical factor.

Still, I will leave this issue open as a minor bug. breedR should either not run a model with a character variable, or automatically convert it to factor.

I prefer later, because it would be similar behavior to other mixed model packages, like lme4.

On the other hand, I don't see why one of your levels gets renamed from D to Z. I could not reproduce it:

Sorry, It was my mistake. I have changed name of site in the output by hand to hide private information.