Closed renkun-ken closed 3 years ago
I would imagine median
could be very memory hungry, this is why you usually see approx_quantile
in "big data" SQL and median itself is dropped entirely.
Do you get the same issue if you try and sort the data by col1
? If not, it may be that gmedian
is trying to do all three columns at once.
As a workaround, do ft[ , .N, keyby = .(grp, col1)]
and get the median from the frequency table, since you said there are far fewer unique values.
Will changing the median
to stats::median
help? It should prevent data.table
from optimizing with gforce
.
Disabling gforce or using stats::median
will not trigger this error. But I'm still curious about why this occurs?
My server memory is 1TB and should be enough to process.
how about stats::median(c(col1, col2, col3))
(just to check if tripling the memory footprint has an impact)
stats::median(c(col1, col2, col3))
works quite normally and nothing special occurs.
Well, the error is thrown from
However, I also see the nrow
is declared as an int
It looks like it will lead to an overflow for 1423657324 * 2...
@renkun-ken Would you mind to have a test? Just cut the row count of your data to 1073741823L
/ 1073741824L
respectively and try your original code.
I'm expecting the first case (1073741823L row) works but the second will fail.
I believe that's the cause. The overflow leads to an infinite memory allocation...
Rcpp::cppFunction("size_t test(int x) {
return x*2*sizeof(int);
}")
test(1073741823L)
#> [1] 8589934584
test(1073741824L)
#> [1] 1.844674e+19
test(1423657324L)
#> [1] 1.844674e+19
Created on 2020-03-10 by the reprex package (v0.3.0)
Yes, 1073741823L
case works perfectly while 1073741824L
case does not work as you expected.
@renkun-ken It would be great if you can verify that PR #4297 does fix this issue, when you have time, of course.
@shrektan Thanks! I'll verify it soon.
@shrektan sadly too hard to fetch the repo
remote: Enumerating objects: 1431, done.
remote: Counting objects: 100% (1431/1431), done.
remote: Compressing objects: 100% (181/181), done.
Timeout, server github.com not responding. KiB | 2.00 KiB/s
fatal: the remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed
tried many times but no luck
The network issue I encountered before... so I packaged the source and have sent it to your email (I assume you are able to access your personal email).
Thanks for your nice packaging! My git fetch worked magically as I was sleeping last night.
I retried and can confirm that the bug is fixed with the PR. Good work!
BTW, are you suspecting that there are some other similar int
problems like this?
Here are the other 60 or so places with a similar pattern in the code (haven't combed through it)
grep -Enr "[^.][0-9]+\s*[*]" src --include=*.c | grep -Ev "^src/[a-z]+[.]c:[0-9]+:\s*[/][/]"
src/forder.c:260: memset(thiscounts, 0, 256*sizeof(int));
src/forder.c:380: dmask = dround ? 1 << (8*dround-1) : 0;
src/forder.c:425: memset(stat, 0, 257*sizeof(uint64_t));
src/forder.c:728: if (!TMP || !UGRP /*|| TMP%64 || UGRP%64*/) STOP(_("Failed to allocate TMP or UGRP or they weren't cache line aligned: nth=%d"), nth);
src/forder.c:1012: memcpy(my_starts, my_starts_copy, 256*sizeof(uint16_t)); // restore starting offsets
src/forder.c:1051: uint8_t *ugrps = malloc(nBatch*256*sizeof(uint8_t));
src/frank.c:93: dans[xorder[j]-1] = (2*xstart[i]+xlen[i]-1)/2.0;
src/frank.c:126: int offset = 2*xstart[i]+xlen[i]-2;
src/gsumm.c:115: int *TMP = malloc(nrow*2*sizeof(int));
src/gsumm.c:132: int *restrict my_tmp = TMP + b*2*batchSize;
src/gsumm.c:135: int *p = my_tmp + 2*my_counts[w]++;
src/gsumm.c:146: const int *restrict p = TMP + b*2*batchSize + start*2;
src/fsort.c:125: if (batchSize < 1024) batchSize = 1024; // simple attempt to work reasonably for short vector. 1024*8 = 2 4kb pages
src/fsort.c:178: (int)(nBatch*MSBsize*sizeof(R_xlen_t)/(1024*1024)),
src/fsort.c:179: (int)(nBatch*MSBsize*sizeof(R_xlen_t)/(4*1024*nBatch)),
src/assign.c:21: SETLENGTH(x,50+n*2*sizeof(void *)/4); // 1*n for the names, 1*n for the VECSXP itself (both are over allocated).
src/assign.c:575: char *s5 = (char*) malloc(strlen(tc2) + 5); //4 * '_' + \0
src/bmerge.c:496: ival.d-xval.d == rollabs /*#1007*/))
src/bmerge.c:510: xval.d-ival.d == rollabs /*#1007*/))
src/fread.c:194: char *ptr = buf + 501 * flip;
src/fread.c:282: const char *mostConsumed = start; // tests 1550* includes both 'na' and 'nan' in nastrings. Don't stop after 'na' if 'nan' can be consumed too.
src/fread.c:375: ans = (double) tp.tv_sec + 1e-9 * (double) tp.tv_nsec;
src/fread.c:379: ans = (double) tv.tv_sec + 1e-6 * (double) tv.tv_usec;
src/fread.c:434: mmp_copy = (char *)malloc((size_t)fileSize + 1/* extra \0 */);
src/fread.c:596: acc = 10*acc + digit;
src/fread.c:628: acc = 10*acc + digit;
src/fread.c:693: acc = 10*acc + digit;
src/fread.c:727: acc = 10*acc + digit;
src/fread.c:914: E = 10*E + digit;
src/fread.c:1256: int nbit = 8*sizeof(char *); // #nocov
src/fread.c:1655: if (jump0size*100*2 < sz) nJumps=100; // 100 jumps * 100 lines = 10,000 line sample
src/fread.c:1656: else if (jump0size*10*2 < sz) nJumps=10;
src/fread.c:1663: else DTPRINT(_("(%"PRIu64" bytes from row 1 to eof) / (2 * %"PRIu64" jump0size) == %"PRIu64"\n"),
src/fread.c:1664: (uint64_t)sz, (uint64_t)jump0size, (uint64_t)(sz/(2*jump0size)));
src/fread.c:1687: if (ch<lastRowEnd) ch=lastRowEnd; // Overlap when apx 1,200 lines (just over 11*100) with short lines at the beginning and longer lines near the end, #2157
src/fread.c:1823: allocnrow = clamp_szt((size_t)(bytesRead / fmax(meanLineLen - 2*sd, minLen)),
src/fread.c:1824: (size_t)(1.1*estnrow), 2*estnrow);
src/fread.c:1833: DTPRINT(_(" Initial alloc = %"PRIu64" rows (%"PRIu64" + %d%%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]\n"),
src/fread.c:1973: size_t chunkBytes = umax((size_t)(1000*meanLineLen), 1ULL/*MB*/ *1024*1024);
src/fread.c:2030: .buff8 = malloc(rowSize8 * myBuffRows + 8),
src/fread.c:2031: .buff4 = malloc(rowSize4 * myBuffRows + 4),
src/fread.c:2032: .buff1 = malloc(rowSize1 * myBuffRows + 1),
src/fread.c:2102: ctx.buff8 = realloc(ctx.buff8, rowSize8 * myBuffRows + 8);
src/fread.c:2103: ctx.buff4 = realloc(ctx.buff4, rowSize4 * myBuffRows + 4);
src/fread.c:2104: ctx.buff1 = realloc(ctx.buff1, rowSize1 * myBuffRows + 1);
src/fread.c:2453: DTPRINT(_("%8.3fs (%3.0f%%) Memory map %.3fGB file\n"), tMap-t0, 100.0*(tMap-t0)/tTot, 1.0*fileSize/(1024*1024*1024));
src/fread.c:2460: tAlloc-tColType, 100.0*(tAlloc-tColType)/tTot, (uint64_t)allocnrow, ncol, DTbytes/(1024.0*1024*1024), (uint64_t)DTi, 100.0*DTi/allocnrow);
src/fread.c:2464: tReread-tAlloc, 100.0*(tReread-tAlloc)/tTot, nJumps, nSwept, (double)chunkBytes/(1024*1024), (int)(DTi/nJumps), nth);
src/fifelse.c:167: REPROTECT(cons = eval(SEXPPTR_RO(args)[2*i], rho), Icons);
src/fifelse.c:168: REPROTECT(outs = eval(SEXPPTR_RO(args)[2*i+1], rho), Iouts);
src/fifelse.c:173: error("Argument #%d must be logical.", 2*i+1);
src/fwrite.c:376: ch += 7 + 2*!squashDateTime;
src/fwrite.c:389: ch += 8 + 2*!squashDateTime;
src/fwrite.c:614: size_t buffSize = (size_t)1024*1024*args.buffMB;
src/fwrite.c:645: size_t maxLineLen = eolLen + args.ncol*(2*(doQuote!=0) + 1/*sep*/);
src/fwrite.c:648: maxLineLen += 2*(doQuote!=0/*NA('auto') or true*/) + 1/*sep*/;
src/fwrite.c:782: if (maxLineLen*2>buffSize) { buffSize=2*maxLineLen; rowsPerBatch=2; }
src/fwrite.c:910: int used = 100*((double)(ch-myBuff))/buffSize; // percentage of original buffMB
Also BTW, for such bugs that require very large data to reproduce, do we need to build test cases for them?
For this bug, a minimal reproducible example is
library(data.table)
n <- 1500000000
ngrp <- 4000
dt <- data.table(group = sample.int(ngrp, n, replace = TRUE), x = runif(n))
res <- dt[, .(
xmedian = median(x, na.rm = TRUE)
), keyby = group]
but dt
is 18GB and creating dt
costs 5-10 mins.
You can get a faster example by getting rid of the random functions.... just use rep()
and c()
... I actually did try to test this.
The difficulty is the requirement of a very large amount of memory...
I tried
dt <- data.table(group = rep(seq_len(ngrp), each = n / ngrp), x = numeric(n))
and
dt <- data.table(group = rep(seq_len(ngrp), each = n / ngrp), x = rep(rnorm(1000), each = n / 1000))
Both are much faster but cannot produce the error. 👀
Instead of rep(..., each =)
, remove the each. When the groupings are sorted, the part of code that uses TMP
isn't used.
I got a similar problem. When running my R script in a Windows SQL Server (R version 3.5.2 and data.table 1.12.0), the following error occurs:
Error in forderv(ans, cols, sort = TRUE, retGrp = FALSE, order = if (decreasing) -order else order, : Failed to allocate TMP or UGRP or they weren't cache line aligned: nth=8 Call: source ... [ -> [.data.table -> eval -> eval -> forder -> forderv
that is called by the following line of forcer.c:
src/forder.c:728: if (!TMP || !UGRP /*|| TMP%64 || UGRP%64*/) STOP(_("Failed to allocate TMP or UGRP or they weren't cache line aligned: nth=%d"), nth);
The script joins two data tables (X and Y) with an on=.(Id,date>=start_date,date<=end_date)
statement and uses by=.EACHI
for the operation. When running the same script on my local RStudio version, there is no error. Do you think setting "Id" and "date" resp. "start_date" and "end_date" as keys in X and Y will prevent the integer overload? Alternatively, would changing by=.EACHI
to keyby=.EACHI
do the thing?
Thank you already in advance.
Do you think setting "Id" and "date" resp. "start_date" and "end_date" as keys in X and Y will prevent the integer overload? Alternatively, would changing by=.EACHI to keyby=.EACHI do the thing?
I don't have experience in MSSQLServer with R but I don't think it will solve your problem. Is it expensive to give a try?
I gave it a try and it did not work out. Do you know what might help here? Could it have to do something with the dependencies between data.table and bit64, since dependencies are sometimes distorted in MSSQL?
In my opinion, it should have nothing to do with bit64 because bit64 has stopped updating for 3 years. I have some (limited) suggestions for you:
data.table::setDTthreads(1L)
Turning off multiple threads did not work. Sometimes, the error changes to "invalid BXL stream", what can be attributed to having not enough memory.
Would you know if the error src/forder.c:728: if (!TMP || !UGRP /*|| TMP%64 || UGRP%64*/) STOP(_("Failed to allocate TMP or UGRP or they weren't cache line aligned: nth=%d"), nth);
can be caused by a lack of RAM?
@scharlatan0139 yes, it does look exactly like an error caused by lack of RAM.
@shrektan , Is the rightway to import your fix is remotes::install_github("Rdatatable/data.table#fix4295")
I tried this but got an invalid repo error
@Debasis5 It should be
remotes::install_github("rdatatable/data.table#4297")
or
remotes::install_github("Rdatatable/data.table@fix4295")
I'm working with a data.table of
1423657324
rows and14
columns.There's an integer group (
grp
) with 3790 unique values.When I do the following
The following error occurs:
This does not occur when a shorter version of the data (1/3 of the size) is done with the same aggregation.