famuvie / breedR

Statistical methods for forest genetic resources analysts
http://famuvie.github.io/breedR/
GNU General Public License v3.0
31 stars 24 forks source link

real zeros unadvertedly treated as NAs #51

Closed famuvie closed 8 years ago

famuvie commented 9 years ago

The underlying FORTRAN libraries need to code NAs as a number. This number is zero by default.

breedR recodes NAs internally as 0 before running REML. Before so doing, it checks whether there are real zeros that would be confused as missing values. In that case it stops with an error message.

However, if there are real zeros but no NAs to recode, breedR does not warns the user, nor changes the code for missing values.

Here is a demonstration of the problem:

summary(remlf90(y~x, data = data.frame(y = c(1, 0), x = 1:2)))
summary(remlf90(y~x, data = data.frame(y = c(1, 0.001), x = 1:2)))

The results of both should be similar. But the first one fits the data only with the first observation: (1, 1) giving as a result the $y = x$ regression line. The second, actually uses both observations, resulting in the $y = 2 - x$ regression line.

famuvie commented 9 years ago

On the other hand, this seems not to be a problem for covariates:

summary(remlf90(y~x, data = data.frame(y = c(1, 1), x = c(0, 2))))
summary(remlf90(y~x, data = data.frame(y = c(1, 1), x = c(0.1, 2))))

both return the $y = 1$ regression line.

while

summary(remlf90(y~x, data = data.frame(y = c(1), x = c(2))))

returns $y = x / 2$.

famuvie commented 9 years ago

Moreover, NAs in the covariates are being passed directly to FORTRAN, without conversion. This causes bizarre results, but runs. Which is dangerous. Should we forbid missing values in covariates?