edwindj / ffbase

Basic (statistical) functionality for R package ff
github.com/edwindj/ffbase/wiki
35 stars 15 forks source link

error in unique #57

Open marejp opened 5 years ago

marejp commented 5 years ago

Setup (Windows 10):

When running the unique sample from CRAN I get:

unique.ff

unique(ffiris$Sepal.Length) Error in if (by < 1) stop("'by' must be > 0") : missing value where TRUE/FALSE needed In addition: Warning message: In chunk.default(from = 1L, to = 300L, by = c(double = 23058430092136940), : NAs introduced by coercion to integer range

ffbase version : 0.12.7 ff version : 2.2-14 bit version : 1.1-14 fastmatch version : 1.1-0

marejp commented 5 years ago

Above is working fine on the following:

Windows(7):

edwindj commented 5 years ago

Thanks for reporting! Seems related to #56 . Will dig into it later this week.

edwindj commented 5 years ago

I cannot reproduce the bug on Rhub (which runs on Windows 2008 SP2), but don't despair...

Technically it is in realm of ff (and not ffbase), but I do have a hunch what the problem might be, using the error message and glaring the ff code (which is not mine).

ff uses chunking to process large vectors and data.frames. The size of a chunk is determined by the option "ffbatchbytes". It seems that on your Windows 10 machine(s) the value for the option isn't set correctly. May be because you are using 32bits R (so one option is to switch to 64bits).

ff sets this value automatically when library(ff) is called (see following code)

copied from ff:::.onLoad()

   if (is.null(getOption("ffmaxbytes"))) {
        if (.Platform$OS.type == "windows") {
            if (getRversion() >= "2.6.0") 
                options(ffmaxbytes = 0.5 * memory.limit() * (1024^2))
            else options(ffmaxbytes = 0.5 * memory.limit())
        }
        else {
            options(ffmaxbytes = 0.5 * 1024^3)
        }
    }

I suggest you set the options(ffmaxbytes) manually and try to run the examples again.

# e.g. 500MB
options(ffmaxbytes =  500 * (1024^2))
marejp commented 5 years ago

Hi Edwin.

Thank you for the feedback. The solution is not working but we're a step closer.

This is the situation at the moment (all on Windows 10):

I'm selection the 64 bit version of R in Rstudio.

Regards.

jwijffels commented 5 years ago

Haven't got windows 10 machine myself but the problem clearly comes from ff::chunk, namely from ff::chunk.ff_vector which is defined as follows

The relevant part is this: b <- BATCHBYTES%/%RECORDBYTES. This calculation apparently on your machine gives 23058430092136940 for reasons beyond my understanding (given that you report it works on Rgui but not on RStudio).

You could probably get around on this by changing option ffbatchbytesto something like this options(ffbatchbytes = 84882227) - which is the number I have on my oldskool windows 7

function (x, RECORDBYTES = .rambytes[vmode(x)], BATCHBYTES = getOption("ffbatchbytes"), 
    ...) 
{
    n <- length(x)
    if (n) {
        l <- list(...)
        if (is.null(l$from)) 
            l$from <- 1L
        if (is.null(l$to)) 
            l$to <- n
        if (is.null(l$by) && is.null(l$len)) {
            b <- BATCHBYTES%/%RECORDBYTES
            if (b == 0L) {
                b <- 1L
                warning("single record does not fit into BATCHBYTES")
            }
            l$by <- b
        }
        l$maxindex <- n
        ret <- do.call("chunk.default", l)
    }
    else {
        ret <- list()
    }
    ret
}