Blocks of variables - Githubissues

aghaynes commented 4 years ago

Hi @arminstroebel

I wanted to implement something to allow blocks of variables to be indented and wanted to check the best place to do that with you...

For instance,

data(mtcars)
atable(mtcars, target_cols = c("mpg", "cyl", "disp", "hp", "am", "gear", "carb"), format_to = "console")
Group                value     
 Observations                   
                      32        
 mpg                            
      Mean (SD)       20 (6)    
      valid (missing) 32 (0)    
 cyl                            
      Mean (SD)       6.2 (1.8) 
      valid (missing) 32 (0)    
 disp                           
      Mean (SD)       231 (124) 
      valid (missing) 32 (0)    
 hp                             
      Mean (SD)       147 (69)  
      valid (missing) 32 (0)    
 am                             
      Mean (SD)       0.41 (0.5)
      valid (missing) 32 (0)    
 gear                           
      Mean (SD)       3.7 (0.74)
      valid (missing) 32 (0)    
 carb                           
      Mean (SD)       2.8 (1.6) 
      valid (missing) 32 (0)

But cyl, disp and mpg are related to the engine and am and gear are related to the gearbox, so I'd like to group them together

atable(mtcars, target_cols = c("mpg", "cyl", "disp", "hp", "am", "gear", "carb"), 
  format_to = "console", blocks = list("Engine" = c("cyl", "disp", "hp"), "Gearbox" = c("am", "gear")))
Group                       value     
 Observations                   
                           32        
 mpg                            
      Mean (SD)            20 (6)    
      valid (missing)      32 (0)    
 Engine
      cyl                            
           Mean (SD)       6.2 (1.8) 
           valid (missing) 32 (0)    
      disp                           
           Mean (SD)       231 (124) 
           valid (missing) 32 (0)    
      hp                             
           Mean (SD)       147 (69)  
           valid (missing) 32 (0)    
 Gearbox 
      am                             
           Mean (SD)       0.41 (0.5)
           valid (missing) 32 (0)    
      gear                           
           Mean (SD)       3.7 (0.74)
           valid (missing) 32 (0)    
 carb                           
      Mean (SD)            2.8 (1.6) 
      valid (missing)      32 (0)

It looks to me like the arrange_statistics and arrange_statistics_and_tests are probably the places to do it - other places would probably result in problems merging the different parts together... Any thoughts? Thanks! Alan

arminstroebel commented 4 years ago

I am considering this...

The functions concerned are atable:::arrange_statistics_and_tests, which calls atable:::replace_consecutive and atable:::indent_data_frame. These are old functions, older than time itself and they are cursed, hence i did not export them. They haunt me in my sleepless nights. Refactoring them is like opening pandoras box, unleashing inexpressible evil to the world. But i documented replace_consecutive and indent_data_frame. in the their .R-files.

The indentation is currently defined implicitely by target_cols and split_cols. The aforementioned functions perhaps need additional arguments, most likely the ellipsis ...

Passing the block-information to atable as you described seems fine.

An issues perhaps will be the order of the target variables: This order is defined by the argument target_cols. Columns in the same block should be consecutive in target_cols.

Also if there are multiple split_cols: whats first: the category of the split col, of the block name?

Perhaps a dirty grepl after indent_data_frame can identifiy the rows to indent and where to add additional lines with the blockname. This grepl should be done before Units or alias are added and before translate_to_LaTeX is called, or else the block names must also be translated/aliased. There is potential confusion for grepl, when a blockname is also a name of the target_cols or other names already present in the data.frame.

Within indent_data_frame there is a function called line_adder, which adds empty lines to data.frames. Copying it (or its parts) may help.

Adding a new test to in test/testthat for this feature is important.

arminstroebel commented 4 years ago

Problem is that only some target_cols are part of a block, and not all of them. Those not in a block should not be (additionally) intended. This breaks the structure of the data.frame tab (see below)

The feature 'Blocks of variables' was not intended when I wrote atable. And now its hard implement with all the constraints of the current code, tests and other features. I think with a modification of atable:::indent_data_frame, blocking could be implemented, but I am not sure how at the moment.

The block variable must be an explicit argument of atable() and indent_data_frame(), and thus all functions in between must pass the argument down, like e.g. Alias_mapping. Default of block should be NULL, which is set in atable.data.frame().

I think, adding the blocking after atable:::indent_data_frame was called is not possible, because all variables are crunched in one coluns and its hard to retrace where to indent and where to add new lines. So a modification of atable:::indent_data_frame is needed

Here is some playaround code; I try to indent with the current version of atable:::indent_data_frame, but it does not work:

add 'broswer()' to line 357 of file helper.R. Line 357 is the first line of function arrange_helper()

target_cols = c("mpg", "cyl", "disp", "hp", "am", "gear", "carb")
split_cols = NULL
blocks = list("Gearbox" = c("am", "gear"), "Engine" = c("cyl", "disp", "hp"))

convert blocks to a data.frame to merge it later with the table

name_adder = function(x, name){data.frame(target_cols=x, block_name=name,
                                          stringsAsFactors = FALSE)}

bb = mapply(name_adder, x=blocks, name=names(blocks),
       SIMPLIFY = FALSE)

bb = do.call(rbind, bb)

bb = doBy::renameCol(bb, "target_cols", atable_options("colname_for_variable"))

bb$block_name = factor(bb$block_name, levels = names(blocks)) # for the order

atable(mtcars,
       target_cols = target_cols,
       split_cols = split_cols,
       format_to = "console")

now 'tab' exists because browser() at the start of arrange_helper() x = merge(tab, bb, all.x=TRUE)

This is a copy the line of arrange_helper() that calls indent_data_frame, but now with the blocks included

a = indent_data_frame(x, keys = c( split_cols, "block_name", atable_options("colname_for_variable"),
            "tag"), indent_character = "    ")

class(a) <- c("atable", class(a)) # set class for print.atable

a

Thats not the intended table, but a bit closer. There is an additional layer of indent for all variables, not just the blocked ones. Also: the order is mixed up; with split_cols this won't work; the Observations are most likely not part of a block.

aghaynes commented 4 years ago

Thanks for thinking about this and looking into it a bit.

An issues perhaps will be the order of the target variables: This order is defined by the argument target_cols. Columns in the same block should be consecutive in target_cols.

Here's a little function I'd written to check the blocks argument

# blocks of variables
check_blocks <- function(){
    # check that variables are in a single block
    ublock <- unlist(blocks)
    if (length(unique(ublock)) < length(ublock)) stop("variables can only be in one block")
    # check that variables are together
    lapply(blocks, function(x){
        # all variables in blocks are targets
        if (!all(x %in% target_cols)){
            y <- x[!x %in% target_cols]
            stop(y, " not in target_cols")
        }
        # consecutive variables
        w <- which(target_cols %in% x)
        y <- min(w):max(w)
        if (length(y) != length(w)) stop("block variables appear to be separated")
        if (any(w != y)) stop("block variables appear to be in a different order to target_cols")
    })
}

Also if there are multiple split_cols: whats first: the category of the split col, of the block name?

I guess the block would be inner most as it would be in each level of the split... I've never needed the split argument though, so I've never really thought about it...

I must confess, the indent_data_frame stuff did get me confused... is it maybe easier to step through blocks and split_cols and paste the indents? I know that's in essence what indent_data_frame does, but maybe it's easier...

for(block in blocks) {
    # add indent
    DD[DD[,"varname___"] %in% block, "varname___"] <- paste0(indent_char, DD[DD[,"varname___"] %in% block, "varname___"])
    w <- min(which(DD[, "varname___"] %in% block))
    DD <- add_new_line(DD, where = w)
    DD[w, "varname___"] <- names(block) # probably doesn't work, but you get the idea
}
for(split in split_cols) {
    # each split needs a new indent
    DD[, "varname___"] <- paste0(indent_char, DD[, "varname___"])
    # etc...
}

I guess that might destroy the aliasing though. Might it be easier if there was a variable containing the variable name from the beginning (e.g. varname that has mpg and varname that has the alias, then you can change the varname without having to worry about destroying any links)?

arminstroebel commented 4 years ago

So I created a test case for blocking. Here is the code and the attachment has the data. The data are created by adding browser() at the first line if function arrange_helper() (this is line 355 of file helper.R) and then calling devtools::load_all("."),

The test case tries to cover the general case of atable useage: that is: with multiple split_cols, a group_col and aliases. The blocking has a block with a single variable and one block with two. Also some target_cols are not part of any block. Further test cases can be created by omitting one or more of split_cols, group_col, alias.

I try to leave indent_data_frame() untouched, but I am not sure if this will work as intended.

attr(mtcars$mpg, "alias") = "Consumption [Miles (US)/ gallon]"

Hmisc::label(mtcars$qsec) = "Quarter Mile Time"
units(mtcars$qsec) = "s"

attr(mtcars$disp, "alias") = "Displacement (cu.in.)"
attr(mtcars$am, "alias") = "Transmission (0 = automatic, 1 = manual)"

attr(mtcars$cyl, "alias") = "cylinder (#)"

atable::atable(mpg + qsec + disp + hp + drat ~ cyl | vs + am,
                            mtcars,
                            format_to = "Console",
                            blocks = list("Block" = c("qsec", "disp"),"Another Block" = "drat" ) )

Now browser() stops the execution and provides the input of arrange_helper(): tab, split_cols, format_to, Alias_mapping

blocks = list("Block" = c("qsec", "disp"),"Another Block" = "drat" )

# block to data.frame
name_adder = function(x, name){data.frame(target_cols=x, block_name=name,
                                          stringsAsFactors = FALSE)}

bb = mapply(name_adder, x=blocks, name=names(blocks),
       SIMPLIFY = FALSE)

bb = do.call(rbind, bb)

bb = doBy::renameCol(bb, "target_cols", atable_options("colname_for_variable"))

bb$block_name = factor(bb$block_name, levels = names(blocks)) # for the order

the_block_name = "block_name" # xxx write the_block_name in atable_options to avoid naming conflicts. "block_name" may already be a split_col or a level of group_col. Also add a stopifnot() for this condition.

# merge with blocks
DD = merge(tab, bb,
           all.x = TRUE)

# order with blocks
keys = c(split_cols, atable_options("colname_for_variable"), the_block_name)
ff <- paste("~", paste(keys, collapse = "+"), collapse = "")

DD_ordered <- doBy::orderBy(stats::as.formula(ff), DD)

Now DD_ordered contains all needed information.

Now indent_data_frame:

keys = c(split_cols, atable_options("colname_for_variable"), the_block_name)

indent_data_frame(DD_ordered,
                  keys = keys, indent_character = "    ")

Problem is that the column block_name contains NA and these are also indented, but thats not intended (sorry, I could not resist the pun).

I am thinking about changing the function line_adder() in indent_data_frame: When there is NA in keys, then do not add a line, something like that. In the current version of atable it is not possible, that keys contain NA.

Also the problem with the alias may be solved by also applying the alias mapping to the blocks, as done in the first step in arrange_helper() on the data.

@

arminstroebel commented 4 years ago

So I am struggling with the test case I proposed on Jul 5, 2020. These split_cols make things nasty.

But I was able to implement the blocking when there are no split_cols!

This needed some modifications of indent_data_frame.

If you are fine with blocking without split_cols then I will update the package.

This could make some mess as I need to find a place where I can insert the new indent_data_frame without too much collateral damage. Also tests must be written and passed.

aghaynes commented 4 years ago

Wow, thanks for working on this!

I would be fine without the split_cols...

For the new version of indent_data_frame, you could do something like the following to define it on the fly (if both versions are necessary)

indent_data_frame  <- ifelse(blocks_yes, 
                             indent_data_frame_original, # assuming you rename the original function
                             indent_data_frame_blocks)

blocks_yes would be an indicator to say whether blocks are present or not. I guess you'd build this to check that blocks and split_cols are not both present anyway.

arminstroebel commented 4 years ago

New version of atable is up on CRAN (0.1.7). Now supports blocks of variables, but only when split_cols are NULL.

I took some lines from your check_blocks() function. Thanks for that.

I just realized that when i can do blocking without splitcols, then I can just merge the split_cols to the blocked data.frame and then do a second indent. But that perhaps for the next version.

I am closing this issue.

aghaynes commented 4 years ago

Wow! Thanks Armin!

arminstroebel / atable

Blocks of variables #5