Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.59k stars 978 forks source link

Optimise dcast.data.table to reduce peak memory usage (currently: reaches 4x table size) #1069

Open NoviceProg opened 9 years ago

NoviceProg commented 9 years ago

I am using data.table's melt and dcast function to melt a wide table, do some transformation and cast it back to a tidy format.

The Issue

I tested with a small file of ~1GB and was very surprised when R ran out of memory on my i7, 16GB machine. I investigated further and realise that dcast.data.table hit a peak memory of ~4x the long-format file's size.

Running my sample code below, the memory requirements for the process are as follow:

I added reshape2's dcast for comparison:

My Request

Can memory usage for dcast.data.table be optimised further?

dcast.data.table is certainly faster than its counterpart but it appears there is some memory overhead. The 1GB file I tested originally is only ~30% of the full dataset, so a dcast.data.table that is more frugal is very much appreciated =)

150306 - dcast mem issue a 150306 - dcast mem issue

Sample Data and Code

library( data.table )
library( reshape2 )
library( stringr )

gc( reset = T )
set.seed(1)
nrow <- 5e5
ColNames <- c( "IDCol1", "IDCol2", "IDCol3", 
               paste0( "A_dd", 1:8 ), paste0( "B_dd", 1:8 ) )

# Create DT
dt <- as.data.table( setNames( c(
  replicate( 2, sample( state.name, nrow, replace = T ), simplify = F ), 
  replicate( 1, 1:nrow, simplify = F ), 
  replicate( 16, round( runif( nrow, 1, 30 ), 2), simplify = F ) ), 
    ColNames ) )
tables()
gc( reset = T )

# Melt dt
L_dt <- melt( dt, id = c( "IDCol1", "IDCol2", "IDCol3" ) )
rm( dt )
gc()

tables()
gc( reset = T )

# Transform dt
L_dt[ , NewCol1 := as.integer( str_extract( L_dt$variable, "[0-9]$" ) ) ]
L_dt[ , variable := str_replace( L_dt$variable, "[0-9]$", "" ) ]
tables()
gc( reset = T )

# Using data.table to re-cast L_dt to W_dt
W_dt1 <- dcast.data.table( L_dt, IDCol1+IDCol2+IDCol3+NewCol1 ~ variable, 
                          value.var = "value" )
gc()
tables()
rm( W_dt1 )
gc( reset = T )

# Using Reshape2 to re-cast L_dt to W_dt
W_dt2 <- dcast( L_dt, IDCol1+IDCol2+IDCol3+NewCol1 ~ variable, 
                          value.var = "value" )
gc()
arunsrinivasan commented 9 years ago

I've added this to the list, but note that melt.data.table can now melt into multiple columns. Your task basically reduces to:

system.time(ans <- melt(dt, id=1:3, measure=list(4:11, 12:19), value=c("A_dd", "B_dd")))
#    user  system elapsed 
#   0.035   0.012   0.047 
head(ans)
#           IDCol1     IDCol2 IDCol3 variable  A_dd  B_dd
# 1:       Indiana New Jersey      1        1  5.06 21.02
# 2:         Maine  Tennessee      2        1 21.17 22.40
# 3: New Hampshire  Louisiana      3        1 22.14 13.83
# 4:      Virginia     Hawaii      4        1  3.66 14.95
# 5:        Hawaii    Montana      5        1  2.93  8.00
# 6:       Vermont     Hawaii      6        1 18.77 13.70

You can order it using setorder() if necessary as follows:

setorder(ans, IDCol1, IDCol2, IDCol3, variable)

PS: I really appreciate the time and effort you take to file a report.

NoviceProg commented 9 years ago

You're most welcome, Arun! And I am very excited that you've taken this up so quickly!

From my little experiment above, it appears that melt is highly memory-efficient, i.e. to create the long table of 184mb, the peak memory reached was merely 202mb. Is it possible to borrow techniques in melt to make dcast reach near this level of efficiency? If yes, then it'll be a data analyst's dream come true!

Among the many discussions about R on SO (not just for data.table, but R in general), I noticed that a large part talks about speed while memory usage is less emphasised. Like in the URL below, 'efficiency' seems to be equated with speed.

Unfortunately, there is a physical limit to how much RAM a machine can handle. My local machine's motherboard maxes out at 16GB while the office's shared workstation has 96GB RAM (not sure what it's maximum capacity is though).

Data.table is central to my work involving big data so I would like to thank you! Indeed, anyone who deals with big data appreciates all the effort that have gone into making data.table such a wonderful package.

https://stackoverflow.com/questions/18802257/efficient-ways-to-reshape-huge-data-from-long-to-wide-format-similar-to-dcast

arunsrinivasan commented 9 years ago

melt is straightforward, but not cast. In my experiments, reshape2::dcast was blowing memory off my system. This is the first case I've seen where dcast.data.table uses slightly more. But always good to know improvements are possible.

On memory, I think we've made attempts in numerous cases to optimise for both. I've detailed some of what the syntax offers in this SO post. That is also the reason we implemented melt on multiple columns directly, so that we don't have to melt the entire data set only to cast again - which is terribly inefficient both in terms of memory and speed.

NoviceProg commented 9 years ago

By golly, Arun, I've finally gotten the new 'multi-column' melt feature to work! We had a similar conversation on Stack Overflow 2 weeks ago when I mentioned this new feature is giving problems. I tried again on 2 machines running: (i) Win 8 and (ii) Win XP SP3; and in both cases, RStudio eventually crashed. Then, I installed RStudio on Linux Mint running in a VM and all's well! I will file a separate issue on this.

Back to memory-efficiency issue for 'melt-process-cast': you're right, it's certainly more memory-efficient to melt straight to the final columns using the new feature. Using my sample code above, peak memory used was only 240MB to generate the 123mb table. Very efficient!

Excited, I proceeded to use the feature to handle when the measure columns have 2 levels of encoding but it didn't work (see error message below). Is this possible in the new feature and if yes, how can I code this? In my dataset, I would need to tackle this too, so hope you can advise.

New code with 2 levels of encoding in column-names:

ColNames <- c( "IDCol1", "IDCol2", "IDCol3", 
               paste0( "T1_A_dd", 1:8 ), paste0( "T1_B_dd", 1:8 ), 
               paste0( "T2_A_dd", 1:8 ), paste0( "T2_B_dd", 1:8 )
               )

dt <- as.data.table( setNames( c(
  replicate( 2, sample( state.name, nrow, replace = T ), simplify = F ), 
  replicate( 1, 1:nrow, simplify = F ), 
  replicate( 32, round( runif( nrow, 1, 30 ), 2), simplify = F ) ), 
    ColNames ) )

ans <- melt(dt, id=1:3, measure=list(4:11, 12:19, 20:27, 28:35), value=c("A_dd", "B_dd"))

Error Message:

Error in melt.data.table(dt, id = 1:3, measure = list(4:11, 12:19, 20:27,  : 
  When 'measure.vars' is a list, 'value.name' must be a character vector of length =1 or =length(measure.vars).

After 2 weeks of being stuck on this data manipulation issue, there's some light... thank you so much!

arunsrinivasan commented 9 years ago

You're melting into 4 different columns, but providing only two value.name values. Please check ?melt.data.table (and let me know if the explanation there is not sufficient / unclear).

arunsrinivasan commented 9 years ago

Notes:

With the recent update to dcast.data.table, this is what I get:

# Using data.table to re-cast L_dt to W_dt
W_dt1 <- dcast.data.table( L_dt, IDCol1+IDCol2+IDCol3+NewCol1 ~ variable, 
                           value.var = "value" )
gc()
#            used  (Mb) gc trigger  (Mb)  max used  (Mb)
# Ncells   399610  21.4     741108  39.6    411115  22.0
# Vcells 60667030 462.9  108295982 826.3 102697779 783.6

and with reshape2:

# Using Reshape2 to re-cast L_dt to W_dt
W_dt2 <- dcast( L_dt, IDCol1+IDCol2+IDCol3+NewCol1 ~ variable, 
                value.var = "value" )
gc()
#            used  (Mb) gc trigger   (Mb)  max used   (Mb)
# Ncells   402958  21.6    3170326  169.4   4903021  261.9
# Vcells 64797629 494.4  160766414 1226.6 160267965 1222.8

But much better improvements are possible by not have to use CJ at all (IIUC), which is a place where we can save a lot (of memory). To revisit when I find time next and update.

NoviceProg commented 9 years ago

Arun, I have implemented my data transformation using the multi-col melt solution. It works perfectly and is very memory-efficient. My dataset is pretty complicated but overall, the peak memory used is only ~2 to 2.5 times the final melted table size. This is in accordance with what I mentioned earlier. As regards column names with 2-levels of encoding, the simple solution is to melt-cast it twice. Thus, no further enhancements needed to what you've already implemented. Thanks again!

arunsrinivasan commented 9 years ago

I don't fully understand why you'd to melt-cast twice. If you could show an example, that'd help (identify if things could be improved).

I'm keeping the FR open as I see further improvements possible for dcast.

NoviceProg commented 9 years ago

Glad to help and I'm always open to possible ways to improve memory efficiency and my procedure.

I needed to melt-cast twice in order to handle wide tables with columns containing 2-levels of encoding in the column names.

Below is a minimal example:

ColNames <- c( "IDCol1", "IDCol2", "IDCol3", 
               paste0( "T1_A_dd", 1:8 ), paste0( "T1_B_dd", 1:8 ), 
               paste0( "T2_A_dd", 1:8 ), paste0( "T2_B_dd", 1:8 )
               )

dt <- as.data.table( setNames( c(
  replicate( 2, sample( state.name, nrow, replace = T ), simplify = F ), 
  replicate( 1, 1:nrow, simplify = F ), 
  replicate( 32, round( runif( nrow, 1, 30 ), 2), simplify = F ) ), 
    ColNames ) )

tmp <- melt(dt, id=1:3, measure=list(4:11, 12:19, 20:27, 28:35), 
            value=c("T1_A_dd", "T1_B_dd", "T2_A_dd", "T2_B_dd"))

final <- melt(tmp, id=1:4, measure=list(c(5,7), c(6,8)), 
              value=c("A_dd", "B_dd"), variable="Type")

In case you're wondering if such 2-level encoding data exists, let me give you a theoretical but possible example (1st column shown only): Country | GDP_2000_H1 | GDP_2000_H2 | GNP_2000_H1 | GNP 2000_H2 | GDP_2001_H1 | GDP_2001_H2 | ....

In the above, GDP and GNP are the 'variables' and if I want the Years and H1/H2 to be separate measures, I would need to melt-cast twice.

Data comes in various shapes and sizes, and how to transform it depends on the analyst's requirements. IMHO, I believe your current 'multi-column' melt in v1.9.5 is flexible enough to handle different situations while concurrently not being overly-complicated in terms of syntax. This is why I suggested previously that no further enhancements in this regard is necessary (of course, if there is a way I can just melt once for my example, please let me know).

arunsrinivasan commented 9 years ago

Thanks, I see now. It requires melt alone (not cast) two times.

IMHO, I believe your current 'multi-column' melt in v1.9.5 is flexible enough to handle different situations while concurrently not being overly-complicated in terms of syntax.

Yes, I agree.

Will close this FR after figuring out if it is possible to get dcast to be slightly more memory efficient (in the way I think it's possible).

NoviceProg commented 9 years ago

melt-cast is the informal 'name' I give your 'multi-col melt' to distinguish it in my mind from the 'normal' melt. You're right, it should just be melt 2 times.

Will take care not to use melt-cast publicly in case it causes more confusion =)

skanskan commented 7 years ago

Hello guys.

I've had the same problem two months ago and this was my solution. My dataset has approximatelly 1 million rows and 5000 variables. I will convert to long format half of them.

Instead of trying to convert the whole datatable at once I broke the operation in a loop, reading a few lines at a time, and saving the result on a file. I think the conversion from wide to long can be splitted in this way. (If I'm wrong please tell me). The only drawback is that it's very slow.

nlines <- 1000000   # number of lines of my file, in truth was calculated from the file.
chunk <- 1000       # I want to process chunks of that number of lines.
veces <- ceiling(nlines/chunk)
mynames <- ....     #contains the column names, the names of my vars
idvars =  grep("_20[0-9][0-9]$",mynames , invert = TRUE) # detects the columns I want to deal with.

for(iter in 0:(times-1)) {
  my <- fread("wide.csv", stringsAsFactors=F, integer64 = "character", skip=1+(iter*chunk), nrows=chunk, na.strings=c("", "NA")) 
  colnames(my) <- mynames
  temp <- melt(my, id.vars = idvars)  
    nuevo <- dcast(
    temp[, `:=`(var = sub('_20[0-9][0-9]$', '', variable), year = sub('.*_', '', variable), variable = NULL)],  
    ... ~ var, value.var='value')  
  fwrite(nuevo, "long.csv", quote=FALSE, sep=",", append=T)
  rm(temp); rm(nuevo); rm(my); gc()
  }
MichaelChirico commented 5 years ago

On current master:

library( data.table )
library( reshape2 )
# 
# Attaching package: ‘reshape2’
# 
# The following objects are masked from ‘package:data.table’:
# 
#     dcast, melt
# 
library( stringr )

gc( reset = T )
#           used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
# Ncells  731340 39.1    1263735 67.5         NA   731340 39.1
# Vcells 1376095 10.5    8388608 64.0      16384  1376095 10.5
set.seed(1)
nrow <- 5e5
ColNames <- c( "IDCol1", "IDCol2", "IDCol3", 
# +                paste0( "A_dd", 1:8 ), paste0( "B_dd", 1:8 ) )

# Create DT
dt <- as.data.table( setNames( c(
# +   replicate( 2, sample( state.name, nrow, replace = T ), simplify = F ), 
# +   replicate( 1, 1:nrow, simplify = F ), 
# +   replicate( 16, round( runif( nrow, 1, 30 ), 2), simplify = F ) ), 
# +     ColNames ) )
tables()
#      NAME    NROW NCOL MB                                                  COLS KEY
# 1:     dt 500,000   19 71            IDCol1,IDCol2,IDCol3,A_dd1,A_dd2,A_dd3,...    
# 2: issues   1,603    8  2 html_url,number,title,labels,milestone,created_at,...    
# Total: 73MB
gc( reset = T )
#            used (Mb) gc trigger  (Mb) limit (Mb) max used (Mb)
# Ncells   735362 39.3    1263735  67.5         NA   735362 39.3
# Vcells 10663749 81.4   23246255 177.4      16384 10663749 81.4

# Melt dt
L_dt <- melt( dt, id = c( "IDCol1", "IDCol2", "IDCol3" ) )
rm( dt )
gc()
#            used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
# Ncells   736001  39.4    1263735  67.5         NA   738126  39.5
# Vcells 33390315 254.8   53261711 406.4      16384 42642640 325.4

tables()
#      NAME      NROW NCOL  MB                                                  COLS KEY
# 1: issues     1,603    8   2 html_url,number,title,labels,milestone,created_at,...    
# 2:   L_dt 8,000,000    5 244                   IDCol1,IDCol2,IDCol3,variable,value    
# Total: 246MB
gc( reset = T )
#            used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
# Ncells   736078  39.4    1263735  67.5         NA   736078  39.4
# Vcells 33415338 255.0   53261711 406.4      16384 33415338 255.0

# Transform dt
L_dt[ , NewCol1 := as.integer( str_extract( L_dt$variable, "[0-9]$" ) ) ]
L_dt[ , variable := str_replace( L_dt$variable, "[0-9]$", "" ) ]
tables()
#      NAME      NROW NCOL  MB                                                  COLS KEY
# 1: issues     1,603    8   2 html_url,number,title,labels,milestone,created_at,...    
# 2:   L_dt 8,000,000    6 305           IDCol1,IDCol2,IDCol3,variable,value,NewCol1    
# Total: 307MB
gc( reset = T )
#            used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
# Ncells   737651  39.4    1263735  67.5         NA   737651  39.4
# Vcells 41420646 316.1   66172966 504.9      16384 41420646 316.1

# Using data.table to re-cast L_dt to W_dt
W_dt1 <- dcast.data.table( L_dt, IDCol1+IDCol2+IDCol3+NewCol1 ~ variable, 
# +                           value.var = "value" )
gc()
#            used  (Mb) gc trigger  (Mb) limit (Mb)  max used (Mb)
# Ncells   752093  40.2    1263735  67.5         NA    767630   41
# Vcells 61437816 468.8  104623184 798.3      16384 101445610  774
tables()
#      NAME      NROW NCOL  MB                                                  COLS
# 1: issues     1,603    8   2 html_url,number,title,labels,milestone,created_at,...
# 2:   L_dt 8,000,000    6 305           IDCol1,IDCol2,IDCol3,variable,value,NewCol1
# 3:  W_dt1 4,000,000    6 153                IDCol1,IDCol2,IDCol3,NewCol1,A_dd,B_dd
#                             KEY
# 1:                             
# 2:                             
# 3: IDCol1,IDCol2,IDCol3,NewCol1
# Total: 460MB
rm( W_dt1 )
gc( reset = T )
#            used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
# Ncells   752116  40.2    1263735  67.5         NA   752116  40.2
# Vcells 41454519 316.3  104623184 798.3      16384 41454519 316.3

# Using Reshape2 to re-cast L_dt to W_dt
W_dt2 <- dcast( L_dt, IDCol1+IDCol2+IDCol3+NewCol1 ~ variable, 
# +                           value.var = "value" )
gc()
#            used  (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
# Ncells   756132  40.4    3192704  170.6         NA   5255230  280.7
# Vcells 65560644 500.2  153948763 1174.6      16384 153948763 1174.6

i.e. similar to Arun's post on 2015-03-16

talegari commented 4 years ago

I am running into this issue multiple times with R session terminating with memory exhaustion. I have started using this function as a replacement (trade-off with speed).

dcastLarge = function(dt, chunk_size = 1e5, ...){

  arguments = list(...)

  nr = nrow(dt)
  n_chunks = ceiling(nr/chunk_size)
  f = rep_len(1:n_chunks, nr)

  dt_list = split(dt, f)

  dt_list = lapply(dt_list
                   , function(chunk){
                       Gmisc::fastDoCall(data.table::dcast
                                         , c(list(data = chunk)
                                             , arguments
                                             )
                                         )
                     }
                   )

  res = data.table::rbindlist(dt_list, use.names = TRUE, fill = TRUE)

  if(!is.null(arguments$fill)){
    col_names = colnames(res)
    filler    = function(x) tidyr::replace_na(x, arguments$fill)

    res[, c(col_names) := lapply(.SD, filler), .SDcols = col_names]
  }
  return(res[])
}

Hoping to see a fast solution.