Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.58k stars 976 forks source link

issue with melt.data.table and patterns #1739

Closed steffenheyne closed 5 months ago

steffenheyne commented 8 years ago

Hi,

I stumbled upon this behavior of melt.data.table:

tmp=data.frame(val_a = sample(x=10,replace=F),val_b = sample(x=10,replace=F),val_c = sample(x=10,replace=F))

tmp$rank_b=rank(tmp$val_b)
tmp$rank_a=rank(tmp$val_a)
tmp$rank_c=rank(tmp$val_c)
str(tmp)

tmpT = melt(data.table(tmp),measure.vars=patterns("val_","rank_"),variable.factor=T,value.name=c("myValue","rank"))
str(tmpT)
ggplot(tmpT,aes(x=rank,y=myValue)) + geom_line(size=2) + facet_wrap(~variable) +theme_bw()

test1

whereas I was expecting this plot:

tmp=data.frame(val_a = sample(x=10,replace=F),val_b = sample(x=10,replace=F),val_c = sample(x=10,replace=F))

tmp$rank_a=rank(tmp$val_a)
tmp$rank_b=rank(tmp$val_b)
tmp$rank_c=rank(tmp$val_c)
str(tmp)

tmpT = melt(data.table(tmp),measure.vars=patterns("val_","rank_"),variable.factor=T,value.name=c("myValue","rank"))
str(tmpT)
ggplot(tmpT,aes(x=rank,y=myValue)) + geom_line(size=2) + facet_wrap(~variable) +theme_bw()

test

the reason is the different order of the first two rank columns (rank_b/rank_a/rank_c vs. rank_a/rank_b/rank_c).

melt probably runs over the different patterns and puts the result together and hence the order of the rank column does not anymore match the order of the value column.

I was especially confused why this happens due to the new factor "variable" that seemed correct as it had 3 levels.

Maybe this simply need a simple mention in the documentation?

Would it be possible to get the factor levels correct and "joined" in the correct way?

Thanks for data.table! steffen

arunsrinivasan commented 8 years ago

It'd be nice to have this feature, but I'm not sure how to do it with what is provided to patterns().

pstoyanov commented 8 years ago

I was about to post the same... The way it currently works is unfortunately useless in my case, which I believe is pretty common-place (Eurostat/OECD/etc. data in wide format, with several variables + suffixes denoting time periods, sometimes even multiple frequencies like both monthly and quarterly).

Fortunately it looks fairly easy to get a crude and partial solution (but still useful at least in my case) at the expense of killing some flexibility.

The way it currently works (fmelt.R), is to take whatever is given as patterns(), pass it through grep on the names of the data.table, with grep's default argument value = FALSE. This returns the positions of the matches in the name list. These positions are later passed to the C code to do the actual work. The beauty is that one can put regular expressions in patterns().

The idea implemented in the code below is to sacrifice some flexibility by restricting the search to "fixed" patterns (grep(fixed = TRUE)). This allows to separate the variable names into a stub (the parts provided in patterns()) and a suffix (which in my case would be time; and c(a,b,c) in the OP example). I am also assuming the user cares not about column positions, but wants the results sorted by the suffixes. Given these, it is fairly easy to extract both names & positions in two calls to grep then separate out the suffix and use it to sort the variable positions before passing them to the C code.

I currently have no idea whether a similar approach will work for true regular expressions as variable patterns.

A note on factor levels -- in the current version, the levels of the factor (or the char strings) are just numbers (the positions) if more than one variable is being melted (and with proper levels if only one). I have replaced the levels with the sorted list of suffixes but it also would affect the single-variable case. How do we properly check whether we're melting one or more variables? Needs to be fixed before actual use!

See the sample code below -- it replaces the whole file fmelt.R, and I've marked my additions. It is intended as proof-of-concept, and can likely be improved a lot.

Thanks for making our lives easier with data.table :)

# Add melt generic, don't import reshape2 as it requires R >= 3.0.0.
melt <- function(data, ..., na.rm = FALSE, value.name = "value") {
    if (is.data.table(data))
        UseMethod("melt", data)
    else
        reshape2::melt(data, ..., na.rm=na.rm, value.name=value.name)
}

patterns <- function(...) {
    p = unlist(list(...), use.names = FALSE)
}

melt.data.table <- function(data, id.vars, measure.vars, variable.name = "variable",
                            value.name = "value", ..., na.rm = FALSE, variable.factor = TRUE, value.factor = FALSE,
                            verbose = getOption("datatable.verbose")) {

    if (!is.data.table(data)) stop("'data' must be a data.table")
    if (missing(id.vars)) id.vars=NULL
    if (missing(measure.vars)) measure.vars = NULL

    measure.sub = substitute(measure.vars)

    if (is.call(measure.sub) && measure.sub[[1L]] == "patterns") {

## >>>>>>>>>>>>>>>>>>>> New cose starts here
        # Basic idea -- take both names and positions of the variables that match in patterns().
        # Then remove the stubs (the parts which are given in patterns()), using grep() as before.
        # Assume that whatever remains after the stub is what the user cares about -- e.g. in X1999, X2000, Y2000, Y1999,
        # and patterns("X", "Y), the user would care about properly separating and sorting 1999, 2000, rather than the variable positions.
        # I call these "suffix_" below (e.g. in X1999, "X" is the stub, "1999" is the suffix). Better terminology will be appreciated.

        # !!! NB: Note the addition of "fixed = TRUE" in the call below -- these will be passed on to grep()
        #         This effectively kills using regular expressions in patterns(), i.e. forced a fixed match.
        #         It remains to be seen whether this approach will work with a regular expression in patterns()...
        measure.vars.names <- data.table(simplify2array(x = lapply(eval(measure.sub), grep, names(data), value = TRUE, fixed = TRUE), higher = FALSE))
        measure.vars.positions <- data.table(simplify2array(x = lapply(eval(measure.sub), grep, names(data), value = FALSE, fixed = TRUE), higher = FALSE))

        setnames(measure.vars.names, measure.vars)
        setnames(measure.vars.positions, paste0("pos_", measure.vars))

        measure.vars.DT <- cbind(measure.vars.names, measure.vars.positions)

        # Remove the stubs to create a sort of an index -- it should be the same for all variables in patterns()
        # It is later used to sort the column numbers of the respective variable
        for(i in measure.vars) {
            measure.vars.DT[, paste0("suffix_", i) := gsub(pattern = i, replacement = "", x = measure.vars.DT[[i]], fixed = TRUE)]
        }

        measure.vars.POS <- NULL
        for(i in measure.vars) {
            setkeyv(measure.vars.DT, paste0("suffix_", i))
            measure.vars.POS <- cbind(measure.vars.POS, measure.vars.DT[[paste0("pos_", i)]])
        }

        # set the levels of the "variable" variable. Use the column that is already sorted, i.e. the key
        variable.factor.levels <- measure.vars.DT[[key(measure.vars.DT)]]

        # There must be a better way to populate the list...?
        measure.vars <- NULL
        for(i in 1:ncol(measure.vars.POS)) {
            measure.vars[[i]] <- measure.vars.POS[, i]
        }

## END of new code <<<<<<<<<<<<<<<<<<<<
    }
    if (is.list(measure.vars) && length(measure.vars) > 1L) {
        if (length(value.name) == 1L)
            value.name = paste(value.name, seq_along(measure.vars), sep="")
    }

    ans <- .Call("Cfmelt", data, id.vars, measure.vars,
                 as.logical(variable.factor), as.logical(value.factor),
                 variable.name, value.name, as.logical(na.rm),
                 as.logical(verbose))

    setDT(ans)
    if (any(duplicated(names(ans)))) {
        cat("Duplicate column names found in molten data.table. Setting unique names using 'make.names'\n")
        setnames(ans, make.unique(names(ans)))
    }
    setattr(ans, 'sorted', NULL)

## >>>>>>>>>>>>>>>>>>>> New code starts here
    if(variable.factor) levels(ans[[variable.name]]) <- variable.factor.levels
## END of new code <<<<<<<<<<<<<<<<<<<<

    ans
}
UweBlock commented 7 years ago

I came across this issue just by chance. I would like to add an observation and to suggest a simple solution.

Column order and base R reshape()

It seems that column order is the key. Even reshape() from base R has a similar "issue"

set.seed(123L)
tmp <- data.frame(
  val_a = sample(x = 10, replace = FALSE),
  val_b = sample(x = 10, replace = FALSE),
  val_c = sample(x = 10, replace = FALSE)
)

tmp$rank_b <- rank(tmp$val_b)
tmp$rank_a <- rank(tmp$val_a)
tmp$rank_c <- rank(tmp$val_c)

tmpR <- reshape(tmp, dir = "long", varying = names(tmp), sep = "_")
library(ggplot2)
ggplot(tmpR, aes(x = rank, y = val)) + geom_line(size = 2) + 
  facet_wrap(~ time) + theme_bw()

reshape_unsorted

Apparently, also reshape() is depending on the order of columns as specified by the varying parameter despite it returns the time variable.

By sorting the columns in the varying parameter, we get the expected result:

tmpR <- reshape(tmp, dir = "long", varying = sort(names(tmp)), sep = "_")
ggplot(tmpR,aes(x=rank,y=val)) + geom_line(size=2) + facet_wrap(~time) +theme_bw()

reshape_sorted

Solution / workaround for data.table

For melt.data.table() it is not sufficient to specify measure.vars in the proper order but the columns of the data.table to be melted can be reordered instead:

library(data.table)   # CRAN version 1.10.4 used
setcolorder(tmp, sort(names(tmp)))
tmpT = melt(
  setDT(tmp),
  measure.vars = patterns("val_", "rank_"),
  variable.factor = TRUE,
  value.name = c("myValue", "rank")
)
ggplot(tmpT,aes(x=rank,y=myValue)) + geom_line(size=2) + facet_wrap(~variable) +theme_bw()

melt_sorted

tdhock commented 4 years ago

hi a solution to this would be to use nc::capture_melt_multiple (which uses melt.data.table internally) https://github.com/tdhock/nc

nc::capture_melt_multiple(data.table(tmp), column=".*", "_", letter="[a-z]")

(the first argument is the data table to melt, and the other arguments define a regex; matching column names are melted; values in the "column" group are used for the output column names aka value.name)

tdhock commented 3 years ago

hi there seem to be several work-arounds so I'm closing this issue.

tmp=data.frame(val_a = sample(x=10,replace=F),val_b = sample(x=10,replace=F),val_c = sample(x=10,replace=F))
tmp$rank_b=rank(tmp$val_b)
tmp$rank_a=rank(tmp$val_a)
tmp$rank_c=rank(tmp$val_c)
nc::capture_melt_multiple(tmp, column=".*", "_", letter="[a-z]")
#>     letter rank val
#>  1:      a    1   1
#>  2:      a    7   7
#>  3:      a    9   9
#>  4:      a    5   5
#>  5:      a    2   2
#>  6:      a    3   3
#>  7:      a    8   8
#>  8:      a    4   4
#>  9:      a    6   6
#> 10:      a   10  10
#> 11:      b    6   6
#> 12:      b    9   9
#> 13:      b    4   4
#> 14:      b    1   1
#> 15:      b   10  10
#> 16:      b    2   2
#> 17:      b    3   3
#> 18:      b    5   5
#> 19:      b    8   8
#> 20:      b    7   7
#> 21:      c    8   8
#> 22:      c    6   6
#> 23:      c    9   9
#> 24:      c   10  10
#> 25:      c    3   3
#> 26:      c    5   5
#> 27:      c    1   1
#> 28:      c    7   7
#> 29:      c    4   4
#> 30:      c    2   2
#>     letter rank val
jangorecki commented 3 years ago

We need to agree that it is out of scope and workarounds or other packages are fine.

tdhock commented 3 years ago

pure-data.table solution in #4731

remotes::install_github("Rdatatable/data.table@melt-custom-variable")
#> Skipping install of 'data.table' from a github remote, the SHA1 (c02fa9e8) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(data.table)
tmp=data.table(val_a = sample(x=10,replace=F),val_b = sample(x=10,replace=F),val_c = sample(x=10,replace=F))
tmp$rank_b=rank(tmp$val_b)
tmp$rank_a=rank(tmp$val_a)
tmp$rank_c=rank(tmp$val_c)
melt(tmp, measure.vars=measure(value.name, letter))
#>     letter val rank
#>  1:      a   3    3
#>  2:      a   1    1
#>  3:      a   6    6
#>  4:      a  10   10
#>  5:      a   7    7
#>  6:      a   5    5
#>  7:      a   4    4
#>  8:      a   2    2
#>  9:      a   8    8
#> 10:      a   9    9
#> 11:      b   5    5
#> 12:      b   8    8
#> 13:      b   1    1
#> 14:      b  10   10
#> 15:      b   2    2
#> 16:      b   6    6
#> 17:      b   4    4
#> 18:      b   9    9
#> 19:      b   3    3
#> 20:      b   7    7
#> 21:      c   2    2
#> 22:      c   5    5
#> 23:      c   8    8
#> 24:      c   3    3
#> 25:      c   1    1
#> 26:      c   6    6
#> 27:      c   9    9
#> 28:      c   4    4
#> 29:      c   7    7
#> 30:      c  10   10
#>     letter val rank
tdhock commented 3 years ago

hi @steffenheyne #4731 was merged recently so can you please try the new measure() function? If that works for you, can you please close this issue?