IDEMSInternational / R-Instat

A statistics software package powered by R
http://r-instat.org/
GNU General Public License v3.0
38 stars 103 forks source link

Prepare > Data Frame > Convert gives an odd result #7885

Open rdstern opened 2 years ago

rdstern commented 2 years ago

I hope @dannyparsons could suggest on what to do about this result, but expect it could become a puzzle for @lilyclements? Here is a data frame called abundant abundant.zip

It has character variables that need to be made numeric: image In the Convert dialogue I put the variables split to split22. image

The default making them numeric is shown above. I get the correct answer to the convert by using the simple convert option. If I leave it with the default I get the seemingly odd result as follows: image

This is the labelled convert, but not a set of data where I would usually want labels. The results are fine up to split4, but then the blanks caused by the split dialogue produce the odd labelled results from then on.

What should we do? This is not a dialogue I have ever liked, but am not sure what should be done - at least as a default for this sort of situation?

We could make the Simple Convertthe default when converting to numeric. However, then the Labelled Convert becomes invisible, so you might not know it exists? And do we need those factor options in the dialogue, when converting from Character to Numeric?

I would not have noticed this problem if I could have used the right-click Convert to numeric easier. It works fine, because I can do a Normal Convert, which is the same as the Simple Convert in the dialogue. Should we call it the same in both. And in the right-click option could there be our new special addition to the Normal Convert (and Labelled) Convert buttons perhaps just Apply All. Or maybe it could be a checkbox. If a checkbox, then it should probably not remember, but be unchecked each time.

lilyclements commented 1 year ago

@rdstern this is coming from the sjlabelled::as_numeric function.

To explain the issue - "8" becomes "28" because it is the 28th "level" as far as R is concerned - this is because when R does factor levels, it does not give them by what we see as the normal "order", but it does it by the first number.

E.g: 1, 10, 100283, 2, 3, 4, 4444, 5, etc Rather than 1, 2, 3, 4, 5, 10, 4444, 100283

However, this is not usually what we would expect to happen in R, because these are character variables, not factor variables

What I suspect is happening here is that our variable is converted into a factor before it is converted as a numeric variable. I think this is due to the missing values.

# E.g., if we have a character vector with no missing values, then it converts fine for sjlabelled::as_numeric
# but `as.numeric(as.factor(` is not so nice since it replaces our vector as the level order

a <- c("2", "1", "11", "200")
#[1] "2"   "1"   "11"  "200"

as.numeric(a)
#[1]   2   1  11 200

as.numeric(as.factor(a))
#[1] 3 1 2 4

sjlabelled::as_numeric(a, keep.labels = TRUE)
#[1]   2   1  11 200
# E.g., if we have a character vector with missing values, then it converts for sjlabelled::as_numeric like it would if it were a factor
# E.g., it replaces our vector as the level order

d <- c("2", "1", "11", "200", "")
as.numeric(d)
as.numeric(as.factor(d))
sjlabelled::as_numeric(d, keep.labels = TRUE)

One suggestion around this would be to replace your blank values with the numeric value you would want to give them. For example, with a 0, or -99, etc. That then works -

x <- readRDS("C:/Users/lclem/Downloads/abundant/abundant.RDS")
x$split5 <- ifelse(x$split5 == "", 0, x$split5)
sjlabelled::as_numeric(x$split5, keep.labels = TRUE)