Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.62k stars 985 forks source link

RStudio and R crashes with fatal error linked to data.table operation #2672

Closed AmyMikhail closed 6 years ago

AmyMikhail commented 6 years ago

I am attempting to summarize some variables cumulatively by group for each week in which there was new activity in that group, with a data.table line listing as input. This process works fine with a toy version of the function and a small input data set; however with larger data sets and the real (longer) function, R crashes with a fatal error. The details of the crash are here:

Problem signature:
Problem Event Name: APPCRASH
Application Name:   rsession.exe
Application Version:    1.1.383.0
Application Timestamp:  59d5818a
Fault Module Name:  datatable.dll
Fault Module Version:   0.0.0.0
Fault Module Timestamp: 5a39aedc
Exception Code: c0000005
Exception Offset:   0000000000029060
OS Version: 6.1.7601.2.1.0.256.48
Locale ID:  2057
Additional Information 1:   0af1
Additional Information 2:   0af12ed5ffbb0678966a6b7fe308a74b
Additional Information 3:   2514
Additional Information 4:   251401539b18800e11472f6b420048c8`

Session info is here:

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                            LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rgisws_0.1.2        devtools_1.13.4     openxlsx_4.0.17     scales_0.5.0        ggplot2_2.2.1       stringdist_0.9.4.6 
 [7] stringr_1.3.0       stringi_1.1.6       ISOweek_0.6-2       lubridate_1.6.0     data.table_1.10.4-3 PKI_0.1-5.1        
[13] base64enc_0.1-3     digest_0.6.13       getPass_0.2-2       RPostgreSQL_0.6-2   DBI_0.7            

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.14     magrittr_1.5     munsell_0.4.3    colorspace_1.3-2 rlang_0.1.6      plyr_1.8.4       tools_3.4.2     
 [8] parallel_3.4.2   grid_3.4.2       gtable_0.2.0     withr_2.1.1.9000 yaml_2.1.14      lazyeval_0.2.1   tibble_1.4.2    
[15] memoise_1.1.0    pillar_1.1.0     compiler_3.4.2

Here is some example data:

# Example data:
mydt <- data.table(id = c("a", "a", "b", "b", "b", "c", "c", "c", "c", "d", "d", "d"),
                   grp = c("G1", "G1", "G1", "G1", "G1", "G2", "G2", "G2", "G2", "G2", "G2", "G2"),
                   name = c("Jack", "John", "Jill", "Joe", "Jim", "Julia", "Simran", "Delia", "Aurora", "Daniele", "Joan", "Mary"),
                   sex = c(NA, NA, "f", "m", "m", "f", "m", "f", "f", "f", "f", "f"), 
                   age = c(2,12,29,15,30,75,5,4,7,55,43,39), 
                   reportweek = c("201740", "201750", "201801", "201801", "201801", "201748", "201748", "201749", "201810", "201752", "201752", "201801"))

# Set levels of group ID to loop over:
idlist <- c("id", "grp")

Here is a toy version of the function:

sumclusters <- function(dataset, groupvector) {

  # Load required packages:
  require(data.table)
  require(lubridate)

  #browser()

  # Create empty list to hold the results:
  datalist = list()

  for (i in unique(dataset$reportweek)) {
    # Subset by cumulative report week
    dpart = subset(dataset, reportweek <= i)

    # Calculate summary values within each subset:

    # Cluster size, age, duration:
    dpart[, clustersize := .N, by = eval(groupvector)]
    dpart[, maxweek := max(reportweek, 0, na.rm = TRUE), by = eval(groupvector)]
    dpart[, minweek := min(reportweek, na.rm = TRUE), by = eval(groupvector)]
    dpart[, repdate := as.Date(paste(substring(reportweek, 1, 4), substring(reportweek, 5, 6), 1, sep = "-"), "%Y-%U-%u")]
    dpart[, maxwkdate := as.Date(paste(substring(maxweek, 1, 4), substring(maxweek, 5, 6), 1, sep = "-"), "%Y-%U-%u")]
    dpart[, firstdate := min(repdate, na.rm = TRUE), by = eval(groupvector)]
    dpart[, lastdate := max(repdate, na.rm = TRUE), by = eval(groupvector)]
    dpart[, clusteragemonths := (as.numeric(lastdate - firstdate))/(365.25/12)]

    # Cluster characteristics last month:
    dpart[, clustersize4wk := sum(reportweek >= isoyrwk(maxwkdate - weeks(4))), by = eval(groupvector)]
    dpart[, fourweekgr := round(clustersize4wk/4, 2)]

    # Age range:
    dpart[, agemin := min(age, na.rm = TRUE), by = eval(groupvector)]
    dpart[, agemax := max(age, 0, na.rm = TRUE), by = eval(groupvector)]
    dpart[, agemedian := round(median(age, na.rm = TRUE), 0), by = eval(groupvector)]
    dpart[, agerange := paste(agemin, " - ", agemax, sep = "")]
    dpart[, adult := length(which(age >= 16)), by = eval(groupvector)]
    dpart[, total4age := length(which(!is.na(age))), by = eval(groupvector)]
    dpart[, adultprop := round((adult/total4age)*100, 0)]

    # Sex:
    dpart[, female := sum(sex == "f"), by = eval(groupvector)]
    dpart[, male := sum(sex == "m"), by = eval(groupvector)]
    dpart[, total4sex := length(which(!is.na(sex))), by = eval(groupvector)]
    dpart[, maleprop := round((male/total4sex)*100, 0)]
    dpart[, sexratio := paste("F ", female, " : ", "M ", male, sep = "")]

    # Add to list:
    datalist[[i]] = dpart

  }

  # Bind all the summaries together:
  clusterlog = rbindlist(l = datalist, use.names = T, fill = T, idcol = "ReportNo")

  # Create list of report columns:
  reportcols = c(eval(groupvector), "clustersize", "minweek", "maxweek", "clusteragemonths", "clustersize4wk", "fourweekgr",
                 "agemin", "agemax", "agemedian", "agerange", "adultprop", 
                 "maleprop", "sexratio")

  # Create cluster tab:
  clustertab = clusterlog[, reportcols, with = F]

  # Set the key prior to deduplicating:
  setkeyv(clustertab, c(groupvector))

  # Deduplicate to get the summary table:
  clustertab = unique(clustertab)

  # Sort table by ID and report week:
  setorderv(clustertab, c(groupvector, "maxweek"))

  # Create list of line list columns:
  linelistcols = c("id", "grp", "name", "sex", "age", "reportweek")

  # Create cluster line list:
  clusterll = dataset[, linelistcols, with = F]

  # Return the summarised data.tables in a list:
  dtlist <- list("linelist" = clusterll, "clustersum" = clustertab)
  return(dtlist)

}

And dependent function isoyrwk:

isoyrwk <- function(dates){
  require(lubridate)
  require(data.table)

  isoweekcalc = data.table(myisoweek = isoweek(dates),
                            myisoyear = isoyear(dates),
                            calcweek = week(dates),
                            calcyear = lubridate::year(dates))
  isoweekcalc[ myisoweek >= 52 & calcweek == 1, myisoyear := calcyear]
  isoweekcalc[ myisoweek >= 52 & calcweek == 1, myisoweek := 1]
  isoweekcalc[ myisoweek >= 52 & calcweek >= 52, myisoyear := calcyear]
  isoweekcalc[ myisoweek >= 52 & calcweek >= 52, myisoweek := 52]
  isoweekcalc[, myisoweek := as.character(myisoweek)]
  isoweekcalc[ nchar(myisoweek) == 1, myisoweek := paste0("0", myisoweek)]
  isoweekcalc[, isoyearweek := paste0(myisoyear, myisoweek)]
  isoyearweek = isoweekcalc$isoyearweek
  isoyearweek
}

This is how I am applying the sumclusters function to my data: mytest <- sapply(idlist, sumclusters, data = mydt, simplify = FALSE, USE.NAMES = TRUE)

Unfortunately I am not able to reproduce the fatal error with the toy data set and toy function. The only difference between the toy function and the real one is that there are more conditional counts on different variables, but the strategy for each one is exactly the same as shown above. I was originally getting a RHS / LHS class discrepancy error but this was discussed and resolved in this Stack Overflow post.

My real input data set is relatively large (2103 rows in the line listing, with reports in 155 weeks and four grouping vectors containing 1271, 144, 108 and 94 groups each, respectively).

I think the error might be due to the function timing out because there is too much data (I have 16GB ram and my .Rproj file is on a network drive) but is there any way to confirm this from the above error? Or is it a bug?

Any insights into why this is causing R to crash and how I could prevent this would be much appreciated - hope I have posted this in the right place as the error details did specify that the fault module name is datatable.dll

mrmanojrai commented 6 years ago

Couple of queries/observations:

  1. Do you get crash with sample data provided with this issue?
  2. Any specific reason to use as.Date when lubridate has already been loaded and used?
  3. Most of the j sections operations could have been grouped and executed in one go as by argument is same as (by = eval(groupvector)).
MichaelChirico commented 6 years ago

have you reproduced this on the command line?

On Wed, Mar 14, 2018, 4:01 AM mrmanojrai notifications@github.com wrote:

Couple of queries/observations:

  1. Do you get crash with sample data provided with this issue?
  2. Any specific reason to use as.Date when lubridate has already been loaded and used?
  3. Most of the j sections operations could have been grouped and executed in one go as by argument is same as (by = eval(groupvector)).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Rdatatable/data.table/issues/2672#issuecomment-372799300, or mute the thread https://github.com/notifications/unsubscribe-auth/AHQQdf2-fICsJM4TqLsejZbO8n9a16BVks5teCWugaJpZM4SpSht .

AmyMikhail commented 6 years ago

@mrmanojrai in response to your questions:

  1. Unfortunately I have not been able to create a small data set that reproduces the error. I'm pretty sure there are no operations in my real function that are not represented in the toy one above, which leaves:
  1. This is not really relevant to the problem at hand (at least I don't think it is), but I needed to get a date from the year week as my ISO year weeks (created with the function which handles year-end dates differently to the equivalent function in lubridate) are stored as numeric and I can't perform mathematical operations directly on them. If there is a way to do this with lubridate I would be happy to change this but I'm not aware of a yw function in lubridate? Essentially I was looking for a way to iteratively define four week periods based on the maximum week by group for each iteration - 4 weeks.

  2. Could you elaborate on this with an example?

@MichaelChirico not sure what you mean viz. command line - do you mean what happens if I run this in base R rather than RStudio? I will try this and let you know the outcome.

AmyMikhail commented 6 years ago

Update: base R crashes with the same error details:

Problem signature:
  Problem Event Name:   APPCRASH
  Application Name: Rgui.exe
  Application Version:  3.42.7832.0
  Application Timestamp:    59ccc2b1
  Fault Module Name:    datatable.dll
  Fault Module Version: 0.0.0.0
  Fault Module Timestamp:   5a39aedc
  Exception Code:   c0000005
  Exception Offset: 0000000000029060
  OS Version:   6.1.7601.2.1.0.256.48
  Locale ID:    2057
  Additional Information 1: cb11
  Additional Information 2: cb11abf51219d08bb34e1d4ff9f1a95b
  Additional Information 3: ff0f
  Additional Information 4: ff0f0ac722dc1c648282a37d7681e735
AmyMikhail commented 6 years ago

Update 2: Rterm.exe (which curiously is what opened when I clicked on R.exe) also crashes:

Problem signature:
  Problem Event Name:   APPCRASH
  Application Name: Rterm.exe
  Application Version:  3.42.7832.0
  Application Timestamp:    59ccc2b3
  Fault Module Name:    datatable.dll
  Fault Module Version: 0.0.0.0
  Fault Module Timestamp:   5a39aedc
  Exception Code:   c0000005
  Exception Offset: 0000000000029060
  OS Version:   6.1.7601.2.1.0.256.48
  Locale ID:    2057
  Additional Information 1: 56c2
  Additional Information 2: 56c2cbd44ff199678d57bf6e48a1f624
  Additional Information 3: 7c2f
  Additional Information 4: 7c2f3e180b4d57616b407bae15c5c322
AmyMikhail commented 6 years ago

Update 3:

On further investigation, dumping the summarized files to .csv I was able to determine that the problem was not in summarizing the data, but rather with the call to rbindlist - where rbindlist in the CRAN released version of data.table cannot handle empty tables and crashes R as described in issue #2340 .

The issue was fixed in data.table 1.10.5 (development version) and I'm happy to report that after upgrading to 1.10.5 my real function runs on my real data and produces the desired output without crashing R.

Although I was unable to reproduce the problem with a MWE, I think that is just because the MWE didn't sufficiently reflect the complexity of my real data set. I will close the issue.