Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.6k stars 983 forks source link

Error performing aggregation on very large datat.table #4103

Closed zachokeeffe closed 4 years ago

zachokeeffe commented 4 years ago

Hello,

I'm not sure how to reproduce a minimal reproducible example or find other examples, but I've encountered this issue several times working with very large data.tables (hundreds of millions of rows). I'm trying to calculate the average and standard deviation of values by id, but I get the following error:

(fulldt<-fulldt[,list(r9m=mean(r9),r9s=sd(r9),r9lm=mean(r9l),r9ls=sd(r9l)),by='id']) Error in gforce(thisEnv, jsub, o, f, len__, irows) : Internal error: Failed to allocate counts or TMP when assigning g in gforce Calls: [ -> [.data.table -> gforce Execution halted

Any suggestions?

jangorecki commented 4 years ago

Hello, It seems that you are running out of memory. You can either increase the memory in your machine or try to compute aggregates in chunks.

When submitting an issue please include version of data.table you are using. In such a case also providing available memory on your machine and the memory size of a data.table you are trying computing on would help too.

MichaelChirico commented 4 years ago

I'm not sure we should close... maybe there's a memory explosion.

@zachokeeffe can you try to include a memory profile for this? Also, can you try to run it with GForce off? It will be slower but there may be a difference for memory.

options(datatable.optimize=0)
# reset to Inf later
jangorecki commented 4 years ago

There might be memory explosion but the report as of now does not let us to do anything about it. If we will get more details then we may want to re-open this issue.

zachokeeffe commented 4 years ago

I was using the previous release version of data.table, but experience the same issue with the developer version. If it's a memory issue it isn't obvious that the supercomputer is detecting this; normally the output in such a case declares that the job ran out of memory. I ran a similar simulation on my machine (2 billion rows, with each id having about 200 observations, and two variables for which I'm calculating the mean and standard deviation) without problem using a relatively low amount of RAM (50GB?):

library(data.table) data.table 1.12.9 IN DEVELOPMENT built 2019-12-09 19:46:12 UTC; root using 8 threads (see ?getDTthreads). Latest news: r-datatable.com sessionInfo() R version 3.5.2 (2018-12-20) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Debian GNU/Linux 10 (buster)

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.3.5.so

locale: [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C [3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8 [5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8 [7] LC_PAPER=en_US.utf8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] data.table_1.12.9

loaded via a namespace (and not attached): [1] compiler_3.5.2

tempIDs<-rep(1:10000000,each=200) set.seed(1) tempVals<-rnorm(length(tempIDs),0,3) (dt<-data.table(id=tempIDs,val=tempVals)) id val 1: 1 -1.87936143 2: 1 0.55092997 3: 1 -2.50688584 4: 1 4.78584241 5: 1 0.98852332

1999999996: 10000000 -0.80316285 1999999997: 10000000 2.05666941 1999999998: 10000000 0.64250524 1999999999: 10000000 -0.63723868 2000000000: 10000000 0.03060346 set(dt,NULL,'val2',tempVals+5) rm(tempIDs,tempVals);gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 341343 18.3 627754 33.6 627754 33.6 Vcells 5000717715 38152.5 12002954096 91575.3 10000718612 76299.5

(dt<-dt[,list(val_mu=mean(val),val_sd=sd(val),val2_mu=mean(val2),val2_sd=sd(val2)),by=id]) id val_mu val_sd val2_mu val2_sd 1: 1 0.1066189 2.787292 5.106619 2.787292 2: 2 0.1219131 3.032115 5.121913 3.032115 3: 3 -0.1254804 3.210103 4.874520 3.210103 4: 4 -0.2989872 3.272904 4.701013 3.272904 5: 5 0.0212134 3.207074 5.021213 3.207074

9999996: 9999996 0.2467058 3.065416 5.246706 3.065416 9999997: 9999997 0.4110975 3.127389 5.411098 3.127389 9999998: 9999998 0.1382737 2.845036 5.138274 2.845036 9999999: 9999999 -0.1086544 3.131983 4.891346 3.131983 10000000: 10000000 0.1496499 3.212241 5.149650 3.212241

This isn't the first time I've encountered this issue, and I'm perplexed about to the conditions under which it throws the gforce error. Upping the RAM does not help.

On Wed, Dec 11, 2019 at 11:24 PM Jan Gorecki notifications@github.com wrote:

There might be memory explosion but the report as of now does not let us to do anything about it. If we will get more details then we may want to open this issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Rdatatable/data.table/issues/4103?email_source=notifications&email_token=AESR2A7S3UPEC3YUNSJLOW3QYG4HTA5CNFSM4JZYH2GKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGVNKMQ#issuecomment-564843826, or unsubscribe https://github.com/notifications/unsubscribe-auth/AESR2A2ZK67Y4LRJ3WSH6BTQYG4HTANCNFSM4JZYH2GA .

zachokeeffe commented 4 years ago

I tried it a different way where I allocated the new vectors using := and then called unique on it, and got this error (with 180 GB of RAM):

(fulldt<-unique(fulldt[,list(id,r9m,r9s,r9lm,r9ls)])) Error: cannot allocate vector of size 12.6 Gb Execution halted Warning message: system call failed: Cannot allocate memory

Is there any way around this?

On Thu, Dec 12, 2019 at 12:15 AM Zachary O'Keeffe zach.okeeffe@gmail.com wrote:

I was using the previous release version of data.table, but experience the same issue with the developer version. If it's a memory issue it isn't obvious that the supercomputer is detecting this; normally the output in such a case declares that the job ran out of memory. I ran a similar simulation on my machine (2 billion rows, with each id having about 200 observations, and two variables for which I'm calculating the mean and standard deviation) without problem using a relatively low amount of RAM (50GB?):

library(data.table) data.table 1.12.9 IN DEVELOPMENT built 2019-12-09 19:46:12 UTC; root using 8 threads (see ?getDTthreads). Latest news: r-datatable.com sessionInfo() R version 3.5.2 (2018-12-20) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Debian GNU/Linux 10 (buster)

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.3.5.so

locale: [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C [3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8 [5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8 [7] LC_PAPER=en_US.utf8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] data.table_1.12.9

loaded via a namespace (and not attached): [1] compiler_3.5.2

tempIDs<-rep(1:10000000,each=200) set.seed(1) tempVals<-rnorm(length(tempIDs),0,3) (dt<-data.table(id=tempIDs,val=tempVals)) id val 1: 1 -1.87936143 2: 1 0.55092997 3: 1 -2.50688584 4: 1 4.78584241 5: 1 0.98852332

1999999996: 10000000 -0.80316285 1999999997: 10000000 2.05666941 1999999998: 10000000 0.64250524 1999999999: 10000000 -0.63723868 2000000000: 10000000 0.03060346 set(dt,NULL,'val2',tempVals+5) rm(tempIDs,tempVals);gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 341343 18.3 627754 33.6 627754 33.6 Vcells 5000717715 38152.5 12002954096 91575.3 10000718612 76299.5

(dt<-dt[,list(val_mu=mean(val),val_sd=sd(val),val2_mu=mean(val2),val2_sd=sd(val2)),by=id]) id val_mu val_sd val2_mu val2_sd 1: 1 0.1066189 2.787292 5.106619 2.787292 2: 2 0.1219131 3.032115 5.121913 3.032115 3: 3 -0.1254804 3.210103 4.874520 3.210103 4: 4 -0.2989872 3.272904 4.701013 3.272904 5: 5 0.0212134 3.207074 5.021213 3.207074

9999996: 9999996 0.2467058 3.065416 5.246706 3.065416 9999997: 9999997 0.4110975 3.127389 5.411098 3.127389 9999998: 9999998 0.1382737 2.845036 5.138274 2.845036 9999999: 9999999 -0.1086544 3.131983 4.891346 3.131983 10000000: 10000000 0.1496499 3.212241 5.149650 3.212241

This isn't the first time I've encountered this issue, and I'm perplexed about to the conditions under which it throws the gforce error. Upping the RAM does not help.

On Wed, Dec 11, 2019 at 11:24 PM Jan Gorecki notifications@github.com wrote:

There might be memory explosion but the report as of now does not let us to do anything about it. If we will get more details then we may want to open this issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Rdatatable/data.table/issues/4103?email_source=notifications&email_token=AESR2A7S3UPEC3YUNSJLOW3QYG4HTA5CNFSM4JZYH2GKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGVNKMQ#issuecomment-564843826, or unsubscribe https://github.com/notifications/unsubscribe-auth/AESR2A2ZK67Y4LRJ3WSH6BTQYG4HTANCNFSM4JZYH2GA .

jangorecki commented 4 years ago

for this single type of query a more efficient way would be to drop unneeded columns first, and then call unique

drop = setdiff(names(fulldt), c("id","r9m","r9s","r9lm","r9ls"))
fulldt[, c(drop) := NULL]
unique(fulldt)

eventually

unique(fulldt, by=c("id","r9m","r9s","r9lm","r9ls"))

Error message Internal error: Failed to allocate counts or TMP when assigning g in gforce is more or less the same as Cannot allocate memory but more verbose. It is clearly a memory issue. Setting options(datatable.optimize=0) made any difference? Using options(datatable.verbose=TRUE) may eventually provde some useful information for debugging.

could you