Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.56k stars 973 forks source link

ERROR with dcast.data.table with big data #1691

Open sjain777 opened 8 years ago

sjain777 commented 8 years ago

Hi, I have data with up to 200K rows and about 12K-20K unique values in value.var which needs to be flattened out using dcast.data.table. I get the following error with such data:

dt1wide <- dcast.data.table(dt1, RowId ~ Feature, function(x) TRUE, fill = FALSE, value.var = "Feature")
# _Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
#   long vectors not supported yet: bmerge.c:51
# In addition: Warning message:
# In setattr(l, "row.names", .set_row_names(length(l[[1L]]))) :
#   NAs introduced by coercion to integer range_

When I reduce the number of rows to 20K, the same syntax as above works fine. Below is code to generate the dummy data:

library(data.table)

#Define function to return a random string
getRandString<-function(len=12) return(paste0("T", paste(sample(c(rep(0:9,each=5),LETTERS,letters),len,replace=TRUE),collapse='')))

#Define parameters for data.table to generate
nrows <- 200000
nvalues <- 12000

#Generate unique character values
vrandtext <- c()

for(i in 1:nvalues)  {
 vrandtext [i] <-  getRandString(4)
}
length(vrandtext )

#Define function to generate dummy data

generatetable <- function(x, vrandtext ) {
      subtable <- function(x, row, vrandtext ) {
                          data.table(row, sample(vrandtext , 1))
    }
    numRepeat <- sample(1:2, 1)
    out <- rbindlist(lapply(1:numRepeat, subtable, row = x, vrandtext = vrandtext))
}

#Generate dummy data
system.time(dt1 <- rbindlist(lapply(1:nrows, generatetable, vrandtext = vrandtext )))

setnames(dt1, c("RowId", "Feature"))
dt1$Feature <- as.factor(dt1$Feature)
dt1wide <- dcast.data.table(dt1, RowId ~ Feature, function(x) TRUE, fill = FALSE, value.var = "Feature")

Also, with the flattened table, the object size increases manifold due to 1 column ("Feature") expanded to several. I would like to reduce the size of the resulting table by filling the flattened columns by BIT instead of LOGICAL. But when I use the following syntax (on a table of 20K rows, ~12K unique values in "Feature"), I get the following error:

library(bit)
dt1wide <- dcast.data.table(dt1, RowId ~ Feature, function(x) as.bit(1), fill = as.bit(0), value.var = "Feature")
# _Error in setDT(ans) : 
#   All elements in argument 'x' to 'setDT' must be of same length_

Could you please let me know of the fix for both the above problems? Thanks!

sjain777 commented 8 years ago

Hi Arun, Did you get a chance to look at the two issues above? Thanks!

arunsrinivasan commented 8 years ago

Just looked at this issue:

#   long vectors not supported yet: bmerge.c:51

This shouldn't be happening for data of these dimensions, and should be fixed. Thanks for spotting this. I'm not sure if I'll be able to invest time on this for this release though :-(. Will see.

niths4u commented 7 years ago

Hi , I am also facing the same issue dcast(dt, A ~ B, fill = 0, value.var = "col_sum")

Error in dim.data.table(x) : long vectors not supported yet: ../../src/include/Rinlinedfuns.h:138

This works fine if dt is small. But fails when it is high. Can't we classify this as bug rather than enhancement?

ucb commented 6 years ago

I have encountered the same error as @niths4u. I am using developer version and it is not fixed in it either!

ljodea commented 6 years ago

This isn't an enhancement, it's a bug. The "enhancement" label is probably what pushes this issue to the back of the queue. I, like many others, have high cardinality data that I need to cast. I use data.table for speed and find that the function is broken.

Please add the "bug" label.

Miachol commented 6 years ago

In fread function, if use skip=2500000000, it will also raise the error: NAs introduced by coercion to integer range

charliekirkwood commented 6 years ago

I am also encountering the same error as @niths4u. And agree with @ljodea - Can this be labelled as a bug? Data.table is so appealing for its speed on large data sets, if this is a soft limit on the size of data it can handle it surely makes sense to mark it as something to fix.

arunsrinivasan commented 5 years ago

The issue is that this can not be fixed with the way dcast is currently implemented. Will need a rewrite. I've added the bug label (I agree that it is technically a bug, although the current error message is much more clearer). I'll have a look at this ASAP but will be lots of work AFAICT.

iembry commented 11 months ago

Does anyone have any updates on when the size limitation for dcast will be fixed?

I have an example posted online that shows an issue with a smaller dataset (smaller than previous posters):

https://www.ecoccs.com/dcast_size-limit.html

The following works as intended (look at the last 2 columns):

library(data.table)

chem_abstracts <- fread("https://www.ecoccs.com/ListInfo-2023-06-30-CA-Index.csv")

chem_abstracts[, c("Internal Tracking Number", "EPA ID #", "TSN #", "Alternate ID", "Synonym Effective Date", "Synonym End Date", "Related Links", "Synonym Comment", "Status") := NULL]

setnames(chem_abstracts, "CAS #", "CAS")

chem_abstractss <- chem_abstracts[1:24100, ]

rsc1 <- dcast(chem_abstractss, ... ~ `Structural Notation Type`, 
      value.var = "Structural Notation", fill = "")

rsc1

The following does not work as intended (look at the last 2 columns):

library(data.table)

chem_abstracts <- fread("https://www.ecoccs.com/ListInfo-2023-06-30-CA-Index.csv")

chem_abstracts[, c("Internal Tracking Number", "EPA ID #", "TSN #", "Alternate ID", "Synonym Effective Date", "Synonym End Date", "Related Links", "Synonym Comment", "Status") := NULL]

setnames(chem_abstracts, "CAS #", "CAS")

chem_abstractss <- chem_abstracts[1:24100, ]

chem_abstracts <- chem_abstracts[1:24900, ]

rsc2 <- dcast(chem_abstracts, ... ~ `Structural Notation Type`, 
      value.var = "Structural Notation", fill = "")

rsc2
tdhock commented 11 months ago

hi! I don't have plans to work on this myself, but if you have time to work on it and submit a PR, I could review it (I have worked on some other reshape code -- melt -- I'm not an expert on dcast internals but I could at least review).

jangorecki commented 11 months ago

looks like reaching int32 limit. if output is not a long vector size, but it is just a temporary working variable that exceeds int32 then chunking the input (probably by common dimensions values) should be sufficient workaround for the moment.