Failed to allocate counts or TMP when assigning g in gforce

renkun-ken commented 4 years ago

I'm working with a data.table of 1423657324 rows and 14 columns.

There's an integer group (grp) with 3790 unique values.

When I do the following

stats <- ft[, .(
      col1 = median(col1, na.rm = TRUE),
      col2 = median(col2, na.rm = TRUE),
      col3 = median(col3, na.rm = TRUE)
    ), keyby = grp]

The following error occurs:

Error in gforce(thisEnv, jsub, o__, f__, len__, irows) :
  Internal error: Failed to allocate counts or TMP when assigning g in gforce

This does not occur when a shorter version of the data (1/3 of the size) is done with the same aggregation.

MichaelChirico commented 4 years ago

I would imagine median could be very memory hungry, this is why you usually see approx_quantile in "big data" SQL and median itself is dropped entirely.

Do you get the same issue if you try and sort the data by col1? If not, it may be that gmedian is trying to do all three columns at once.

As a workaround, do ft[ , .N, keyby = .(grp, col1)] and get the median from the frequency table, since you said there are far fewer unique values.

shrektan commented 4 years ago

Will changing the median to stats::median help? It should prevent data.table from optimizing with gforce.

renkun-ken commented 4 years ago

Disabling gforce or using stats::median will not trigger this error. But I'm still curious about why this occurs?

My server memory is 1TB and should be enough to process.

MichaelChirico commented 4 years ago

how about stats::median(c(col1, col2, col3)) (just to check if tripling the memory footprint has an impact)

renkun-ken commented 4 years ago

stats::median(c(col1, col2, col3)) works quite normally and nothing special occurs.

shrektan commented 4 years ago

Well, the error is thrown from

https://github.com/Rdatatable/data.table/blob/b1b1832b0d2d4032b46477d9fe6efb15006664f4/src/gsumm.c#L114-L115

However, I also see the nrow is declared as an int

https://github.com/Rdatatable/data.table/blob/b1b1832b0d2d4032b46477d9fe6efb15006664f4/src/gsumm.c#L6

It looks like it will lead to an overflow for 1423657324 * 2...

shrektan commented 4 years ago

@renkun-ken Would you mind to have a test? Just cut the row count of your data to 1073741823L / 1073741824L respectively and try your original code.

I'm expecting the first case (1073741823L row) works but the second will fail.

shrektan commented 4 years ago

I believe that's the cause. The overflow leads to an infinite memory allocation...

Rcpp::cppFunction("size_t test(int x) {
                    return x*2*sizeof(int);
                  }")
test(1073741823L)
#> [1] 8589934584
test(1073741824L)
#> [1] 1.844674e+19
test(1423657324L)
#> [1] 1.844674e+19

^{Created on 2020-03-10 by the reprex package (v0.3.0)}

renkun-ken commented 4 years ago

Yes, 1073741823L case works perfectly while 1073741824L case does not work as you expected.

shrektan commented 4 years ago

@renkun-ken It would be great if you can verify that PR #4297 does fix this issue, when you have time, of course.

renkun-ken commented 4 years ago

@shrektan Thanks! I'll verify it soon.

renkun-ken commented 4 years ago

@shrektan sadly too hard to fetch the repo

remote: Enumerating objects: 1431, done.
remote: Counting objects: 100% (1431/1431), done.
remote: Compressing objects: 100% (181/181), done.
Timeout, server github.com not responding. KiB | 2.00 KiB/s  
fatal: the remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed

tried many times but no luck

shrektan commented 4 years ago

The network issue I encountered before... so I packaged the source and have sent it to your email (I assume you are able to access your personal email).

renkun-ken commented 4 years ago

Thanks for your nice packaging! My git fetch worked magically as I was sleeping last night.

I retried and can confirm that the bug is fixed with the PR. Good work!

BTW, are you suspecting that there are some other similar int problems like this?

MichaelChirico commented 4 years ago

Here are the other 60 or so places with a similar pattern in the code (haven't combed through it)

grep -Enr "[^.][0-9]+\s*[*]" src --include=*.c | grep -Ev "^src/[a-z]+[.]c:[0-9]+:\s*[/][/]"
src/forder.c:260:    memset(thiscounts, 0, 256*sizeof(int));
src/forder.c:380:  dmask = dround ? 1 << (8*dround-1) : 0;
src/forder.c:425:  memset(stat,   0, 257*sizeof(uint64_t));
src/forder.c:728:  if (!TMP || !UGRP /*|| TMP%64 || UGRP%64*/) STOP(_("Failed to allocate TMP or UGRP or they weren't cache line aligned: nth=%d"), nth);
src/forder.c:1012:          memcpy(my_starts, my_starts_copy, 256*sizeof(uint16_t));  // restore starting offsets
src/forder.c:1051:  uint8_t  *ugrps =  malloc(nBatch*256*sizeof(uint8_t));
src/frank.c:93:          dans[xorder[j]-1] = (2*xstart[i]+xlen[i]-1)/2.0;
src/frank.c:126:        int offset = 2*xstart[i]+xlen[i]-2;
src/gsumm.c:115:    int *TMP   = malloc(nrow*2*sizeof(int));
src/gsumm.c:132:      int *restrict my_tmp = TMP + b*2*batchSize;
src/gsumm.c:135:        int *p = my_tmp + 2*my_counts[w]++;
src/gsumm.c:146:        const int *restrict p = TMP + b*2*batchSize + start*2;
src/fsort.c:125:  if (batchSize < 1024) batchSize = 1024; // simple attempt to work reasonably for short vector. 1024*8 = 2 4kb pages
src/fsort.c:178:                       (int)(nBatch*MSBsize*sizeof(R_xlen_t)/(1024*1024)),
src/fsort.c:179:                       (int)(nBatch*MSBsize*sizeof(R_xlen_t)/(4*1024*nBatch)),
src/assign.c:21:  SETLENGTH(x,50+n*2*sizeof(void *)/4);  // 1*n for the names, 1*n for the VECSXP itself (both are over allocated).
src/assign.c:575:        char *s5 = (char*) malloc(strlen(tc2) + 5); //4 * '_' + \0
src/bmerge.c:496:                 ival.d-xval.d == rollabs /*#1007*/))
src/bmerge.c:510:                   xval.d-ival.d == rollabs /*#1007*/))
src/fread.c:194:  char *ptr = buf + 501 * flip;
src/fread.c:282:  const char *mostConsumed = start; // tests 1550* includes both 'na' and 'nan' in nastrings. Don't stop after 'na' if 'nan' can be consumed too.
src/fread.c:375:    ans = (double) tp.tv_sec + 1e-9 * (double) tp.tv_nsec;
src/fread.c:379:    ans = (double) tv.tv_sec + 1e-6 * (double) tv.tv_usec;
src/fread.c:434:  mmp_copy = (char *)malloc((size_t)fileSize + 1/* extra \0 */);
src/fread.c:596:    acc = 10*acc + digit;
src/fread.c:628:    acc = 10*acc + digit;
src/fread.c:693:    acc = 10*acc + digit;
src/fread.c:727:      acc = 10*acc + digit;
src/fread.c:914:      E = 10*E + digit;
src/fread.c:1256:      int nbit = 8*sizeof(char *); // #nocov
src/fread.c:1655:    if (jump0size*100*2 < sz) nJumps=100;  // 100 jumps * 100 lines = 10,000 line sample
src/fread.c:1656:    else if (jump0size*10*2 < sz) nJumps=10;
src/fread.c:1663:    else DTPRINT(_("(%"PRIu64" bytes from row 1 to eof) / (2 * %"PRIu64" jump0size) == %"PRIu64"\n"),
src/fread.c:1664:                 (uint64_t)sz, (uint64_t)jump0size, (uint64_t)(sz/(2*jump0size)));
src/fread.c:1687:    if (ch<lastRowEnd) ch=lastRowEnd;  // Overlap when apx 1,200 lines (just over 11*100) with short lines at the beginning and longer lines near the end, #2157
src/fread.c:1823:    allocnrow = clamp_szt((size_t)(bytesRead / fmax(meanLineLen - 2*sd, minLen)),
src/fread.c:1824:                          (size_t)(1.1*estnrow), 2*estnrow);
src/fread.c:1833:      DTPRINT(_("  Initial alloc = %"PRIu64" rows (%"PRIu64" + %d%%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]\n"),
src/fread.c:1973:  size_t chunkBytes = umax((size_t)(1000*meanLineLen), 1ULL/*MB*/ *1024*1024);
src/fread.c:2030:      .buff8 = malloc(rowSize8 * myBuffRows + 8),
src/fread.c:2031:      .buff4 = malloc(rowSize4 * myBuffRows + 4),
src/fread.c:2032:      .buff1 = malloc(rowSize1 * myBuffRows + 1),
src/fread.c:2102:          ctx.buff8 = realloc(ctx.buff8, rowSize8 * myBuffRows + 8);
src/fread.c:2103:          ctx.buff4 = realloc(ctx.buff4, rowSize4 * myBuffRows + 4);
src/fread.c:2104:          ctx.buff1 = realloc(ctx.buff1, rowSize1 * myBuffRows + 1);
src/fread.c:2453:    DTPRINT(_("%8.3fs (%3.0f%%) Memory map %.3fGB file\n"), tMap-t0, 100.0*(tMap-t0)/tTot, 1.0*fileSize/(1024*1024*1024));
src/fread.c:2460:      tAlloc-tColType, 100.0*(tAlloc-tColType)/tTot, (uint64_t)allocnrow, ncol, DTbytes/(1024.0*1024*1024), (uint64_t)DTi, 100.0*DTi/allocnrow);
src/fread.c:2464:            tReread-tAlloc, 100.0*(tReread-tAlloc)/tTot, nJumps, nSwept, (double)chunkBytes/(1024*1024), (int)(DTi/nJumps), nth);
src/fifelse.c:167:    REPROTECT(cons = eval(SEXPPTR_RO(args)[2*i], rho), Icons);
src/fifelse.c:168:    REPROTECT(outs = eval(SEXPPTR_RO(args)[2*i+1], rho), Iouts);
src/fifelse.c:173:      error("Argument #%d must be logical.", 2*i+1);
src/fwrite.c:376:    ch += 7 + 2*!squashDateTime;
src/fwrite.c:389:    ch += 8 + 2*!squashDateTime;
src/fwrite.c:614:  size_t buffSize = (size_t)1024*1024*args.buffMB;
src/fwrite.c:645:  size_t maxLineLen = eolLen + args.ncol*(2*(doQuote!=0) + 1/*sep*/);
src/fwrite.c:648:    maxLineLen += 2*(doQuote!=0/*NA('auto') or true*/) + 1/*sep*/;
src/fwrite.c:782:  if (maxLineLen*2>buffSize) { buffSize=2*maxLineLen; rowsPerBatch=2; }
src/fwrite.c:910:          int used = 100*((double)(ch-myBuff))/buffSize;  // percentage of original buffMB

renkun-ken commented 4 years ago

Also BTW, for such bugs that require very large data to reproduce, do we need to build test cases for them?

For this bug, a minimal reproducible example is

library(data.table)

n <- 1500000000
ngrp <- 4000
dt <- data.table(group = sample.int(ngrp, n, replace = TRUE), x = runif(n))
res <- dt[, .(
  xmedian = median(x, na.rm = TRUE)
), keyby = group]

but dt is 18GB and creating dt costs 5-10 mins.

shrektan commented 4 years ago

You can get a faster example by getting rid of the random functions.... just use rep() and c()... I actually did try to test this.

The difficulty is the requirement of a very large amount of memory...

renkun-ken commented 4 years ago

I tried

dt <- data.table(group = rep(seq_len(ngrp), each = n / ngrp), x = numeric(n))

and

dt <- data.table(group = rep(seq_len(ngrp), each = n / ngrp), x = rep(rnorm(1000), each = n / 1000))

Both are much faster but cannot produce the error. 👀

ColeMiller1 commented 4 years ago

Instead of rep(..., each =), remove the each. When the groupings are sorted, the part of code that uses TMP isn't used.

steffen-windmueller commented 4 years ago

I got a similar problem. When running my R script in a Windows SQL Server (R version 3.5.2 and data.table 1.12.0), the following error occurs: Error in forderv(ans, cols, sort = TRUE, retGrp = FALSE, order = if (decreasing) -order else order, : Failed to allocate TMP or UGRP or they weren't cache line aligned: nth=8 Call: source ... [ -> [.data.table -> eval -> eval -> forder -> forderv that is called by the following line of forcer.c: src/forder.c:728: if (!TMP || !UGRP /*|| TMP%64 || UGRP%64*/) STOP(_("Failed to allocate TMP or UGRP or they weren't cache line aligned: nth=%d"), nth);

The script joins two data tables (X and Y) with an on=.(Id,date>=start_date,date<=end_date) statement and uses by=.EACHI for the operation. When running the same script on my local RStudio version, there is no error. Do you think setting "Id" and "date" resp. "start_date" and "end_date" as keys in X and Y will prevent the integer overload? Alternatively, would changing by=.EACHI to keyby=.EACHI do the thing?

Thank you already in advance.

shrektan commented 4 years ago

Do you think setting "Id" and "date" resp. "start_date" and "end_date" as keys in X and Y will prevent the integer overload? Alternatively, would changing by=.EACHI to keyby=.EACHI do the thing?

I don't have experience in MSSQLServer with R but I don't think it will solve your problem. Is it expensive to give a try?

steffen-windmueller commented 4 years ago

I gave it a try and it did not work out. Do you know what might help here? Could it have to do something with the dependencies between data.table and bit64, since dependencies are sometimes distorted in MSSQL?

shrektan commented 4 years ago

In my opinion, it should have nothing to do with bit64 because bit64 has stopped updating for 3 years. I have some (limited) suggestions for you:

Turn-off the multiple threads, i.e., data.table::setDTthreads(1L)
Use the dev version of data.table (albeit I doubt it will work)
Make a local reproducible example and report it here
Call for support from Microsoft if you have a business license

steffen-windmueller commented 4 years ago

Turning off multiple threads did not work. Sometimes, the error changes to "invalid BXL stream", what can be attributed to having not enough memory.

Would you know if the error src/forder.c:728: if (!TMP || !UGRP /*|| TMP%64 || UGRP%64*/) STOP(_("Failed to allocate TMP or UGRP or they weren't cache line aligned: nth=%d"), nth); can be caused by a lack of RAM?

jangorecki commented 4 years ago

@scharlatan0139 yes, it does look exactly like an error caused by lack of RAM.

Debasis5 commented 3 years ago

@shrektan , Is the rightway to import your fix is remotes::install_github("Rdatatable/data.table#fix4295")

I tried this but got an invalid repo error

shrektan commented 3 years ago

@Debasis5 It should be

remotes::install_github("rdatatable/data.table#4297")

or

remotes::install_github("Rdatatable/data.table@fix4295")

Rdatatable / data.table

Failed to allocate counts or TMP when assigning g in gforce #4295