edwindj / ffbase

Basic (statistical) functionality for R package ff
github.com/edwindj/ffbase/wiki
35 stars 15 forks source link

with() can give wrong results #10

Open nalimilan opened 11 years ago

nalimilan commented 11 years ago

This looks very serious.

couples is a ffdf which is the result of merging adu with another ffdf, using all.x=TRUE. Thus, all the variables below are the same in couples and adu, and the number of rows is the same.

identical(adu$C01AN[], couples$C01AN[])
[1] TRUE
identical(table(adu$C01AN[]), table(couples$C01AN[]))
[1] TRUE
identical(with(adu, table(C01AN)), with(couples, table(C01AN)))
[1] FALSE
all.equal(with(adu, table(C01AN)), with(couples, table(C01AN)), check.attributes=FALSE)
[1] "Numeric: lengths (87, 85) differ"
identical(with(adu, table(C01AN[])), with(couples, table(C01AN[])))
[1] FALSE
all.equal(with(adu, table(C01AN[])), with(couples, table(C01AN[])), check.attributes=FALSE)
[1] "Numeric: lengths (87, 85) differ"
> all.equal(with(adu, table(as.ram(C01AN))), with(couples, table(as.ram(C01AN))), check.attributes=FALSE)
[1] "Numeric: lengths (87, 85) differ"
> all.equal(table(as.ram(adu$C01AN)), table(as.ram(couples$C01AN)), check.attributes=FALSE)
[1] TRUE
> with(adu, table(C01AN[]))

1893 1894 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 
   1    1    2    2    4    4   13   21   28   46   42   85   91  142  185  212  279  304  408  515  565  436  385  458  514  674 1221 1231 1176 1241 1318 1334 
1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 
1408 1448 1480 1516 1654 1620 1598 1591 1625 1604 1573 1572 1559 1556 1561 1417 1630 1706 1851 1792 2287 2470 2530 2570 2528 2468 2557 2457 2420 2421 2501 2477 
1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 
2444 2504 2518 2408 2398 2511 2473 2461 2493 2262 2272 2321 2345 2369 2268 2264 2163 1936 1888 2035 1993 2050 2066 
> with(couples, table(C01AN[]))

1893 1896 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 
   1    1    3    2    6   10   11   23   20   38   33   50   77   95  133  144  196  223  290  191  185  223  238  320  557  553  523  590  606  598  599  674 
1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 
 664  662  771  744  707  696  726  690  673  676  662  665  678  608  718  724  772  787  972 1015 1072 1100 1084 1029 1086  951  956 1000 1079 1037 1015 1034 
1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 
1030  984  951 1029  999  969 1052  979  950  977  980  989  987 1001  972  871  864  900  875  877  853 

> table(adu$C01AN[])

1893 1894 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 
   1    1    2    3   15   15   30   51   79  129  140  228  307  388  523  640  816  984 1237 1495 1685 1326 1162 1303 1601 1959 3557 3700 3571 3772 3919 4172 
1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 
4412 4417 4470 4515 4888 4832 4999 4723 4916 4780 4696 4805 4813 4804 4613 4288 4797 5104 5156 5093 6753 7351 7399 7621 7542 7144 7482 7406 7365 7280 7368 7415 
1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 
7478 7549 7628 7613 7507 7492 7752 7482 7477 7066 7050 7064 7263 7297 6975 6826 6381 5867 5496 5820 5741 5793 5806 
> table(couples$C01AN[])

1893 1894 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 
   1    1    2    3   15   15   30   51   79  129  140  228  307  388  523  640  816  984 1237 1495 1685 1326 1162 1303 1601 1959 3557 3700 3571 3772 3919 4172 
1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 
4412 4417 4470 4515 4888 4832 4999 4723 4916 4780 4696 4805 4813 4804 4613 4288 4797 5104 5156 5093 6753 7351 7399 7621 7542 7144 7482 7406 7365 7280 7368 7415 
1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 
7478 7549 7628 7613 7507 7492 7752 7482 7477 7066 7050 7064 7263 7297 6975 6826 6381 5867 5496 5820 5741 5793 5806 

Note how value 1894 is missing from the output of with(couples, table(C01AN[])), while it is in that of table(couples$C01AN[]), and actually in the vector:

> which(couples$C01AN[] == 1894)
[1] 86099
> which(adu$C01AN[] == 1894)
[1] 86099

More interesting:

> identical(with(adu, table(C01AN[1:54355])), with(couples, table(C01AN[1:54355])))
[1] TRUE
> identical(with(adu, table(C01AN[1:54356])), with(couples, table(C01AN[1:54356])))
[1] FALSE

> with(adu, C01AN[54356])
ff (open) integer length=380481 (380481)
     [1]      [2]      [3]      [4]      [5]      [6]      [7]      [8]          [380474] [380475] [380476] [380477] [380478] [380479] [380480] [380481] 
    1975        0        0        0        0        0        0        0        :     1976     1976     1976     1976     1976     1976     1976     1976 
> with(couples, C01AN[54356])
ff (open) integer length=380481 (380481)
     [1]      [2]      [3]      [4]      [5]      [6]      [7]      [8]          [380474] [380475] [380476] [380477] [380478] [380479] [380480] [380481] 
      NA        0        0        0        0        0        0        0        :       NA       NA       NA       NA       NA       NA       NA       NA 

So it looks like value 54356 is the first one that is different from with.ffd()'s point of view. And indeed:

> chunk(couples)
[[1]]
range index (ri) from 1 to 54355 maxindex 380481 

[[2]]
range index (ri) from 54356 to 108710 maxindex 380481 
[...]

So something is broken in the chunks logic of with.ffdf(). Do you think there might be more bugs like that lurking around? That's pretty annoying when you run analyses using it (or other features of ffbase)...

jwijffels commented 11 years ago

Hi, I think with.ffdf does not handle objects of class table. It only handles things that return vectors, factors, Dates, POSIXct and data.frames but not tables currently or other objects. I think that might be the reason of you finding differences between your 2 ffdf objects as you use with.ffdf together with table.

This makes sense - I believe - as if you use with.ffdf to get another class - say lm or glm, you would need a way on how to combine the results of the different chunks (like combining the results of lm or glm objects).

If your purpose is to test if the merge.ffdf did the right job, maybe you can construct other tests apart from using table inside with.ffdf. Maybe use table.ff directly or table.ff(as.character(yourffvector))?

Correct me if I'm wrong here.

nalimilan commented 11 years ago

No, my purpose was, well... to create cross tables. ;-)

I don't think the problem is specific to tables, as you can see that with(couples, C01AN[54356]) returns a ff vector (why no just one value, I don't understand) with the first value being NA, which is not the case with with(adu, C01AN[54356]).

jwijffels commented 11 years ago

If you want just cross tables. table.ff(as.character(couples$C01AN), as.character(adu$C01AN)) will do.

Apart from that cross classification:

Basically with.ffdf was designed to add another column to an ffdf in mind. With the same length as the ffdf. In with.ffdf, there is this code before the chunking starts:

length(res) <- nrow(data)

which basically reflects this idea that it should return something of the same length

jwijffels commented 11 years ago

Maybe we should add checks which stop the execution of with.ffdf if the expression returns something different than vectors, factors, Dates, POSIXct and data.frames and if the total length of the returned thing does not equal the nrow or length of the original data.

nalimilan commented 11 years ago

Sure, I can use table.ff(), and I do most of the time. The with() syntax is just a habit I have taken to save typing the name of the data frame repeatedly. I don't think with.ffdf() should necessarily work in that case, but if it does it must return correct results.

Your suggestion of checking the length and type of the return value makes sense. But maybe you could simply return the RAM object in case it does not fill the conditions. This would allow both uses of with.ffdf(). Anyway, if it does not support all common use cases of with(), I think the docs need to be more explicit about that.

edwindj commented 11 years ago

with.ffdf indeed assumes that:

Other cases don't work as a normal R object would. I will adjust the documentation. I could add a check if the with statement returns an atomic vector or data.frame (which will be converted to a ff or ffdf object). What your opinon on that?

edwindj commented 11 years ago

Fixed it in github.

Could you please check if it works as expected? It passes our test cases, but we may have forgotten some.

Best

nalimilan commented 11 years ago

Thanks, but it does not seem to fix the problem, since a 1D table is atomic. So I get the same result, without any error:

identical(with(adu, table(C01AN)), with(couples, table(C01AN)))
[1] FALSE

I think you should check the length of the result is equal to the length of the ffdf, as you said. Also, I don't think the docs should say that the result will not be correct: the function should simply trigger an error. No need to scare people by stating that results might be wrong... and this is safer.

But what about returning the R object without converting it to a ff vector if it does not have the correct length? I don't see how this could hurt...

edwindj commented 11 years ago

Changed it, to stop when it happens. Problem is that the code is processed chunk wise and in general it is not possible to know how to combine the results of each chunk. The code assumes it can append it (but if that makes sense depends entirely on the expression that was given to with.ffdf.

However there is (an, I admit, very unlikely) case when this will fail: if the length of the results equals each length of the chunks, but not the total length of the ffdf. e.g. if your table example happens to return a table of the same length as each chunk, then the result will still be wrong. Do you have any suggestions on how to detect that one?

nalimilan commented 11 years ago

OK, the new code seems to catch the table case.

I don't really understand how each chunked result could have the same length as each chunk, but the final result not have the same length as the ffdf. I guess it depends on how you combine chunked results. But then why can't you compare the size of the final result to that of the ffdf? I'm missing something...

edwindj commented 11 years ago

Well it is unlikely, but the following can happen:

iris.ffdf <- as.ffdf(iris)
with(iris.ffdf, table(Species))  # fails because size table = 3

with(iris.ffdf, table(Species), by=3) # succeeds because chunk size equals table size, but result is wrong...
table(iris.ffdf$Species)
nalimilan commented 11 years ago

OK, I see how this works now. But I still don't get why with.ffdf() does not return a vector concatenating the tables obtained for all chunks of the ffdf. This would at least make obvious that something did not go as expected.

nalimilan commented 11 years ago

After using ffbase in more situations, and getting bitten once again, I really think with.ffdf() as it currently behaves should be renamed to something else. In base R, with() imposes no restriction on the dimensions of the return value, and nothing in the docs suggest that the return value should be similar to the input data frame. Thus, users expectations about with.ffdf() are likely to lead to misunderstandings just like mine. with.ffdf() should just evaluate its arguments in an environment containing the variables of the ffdf object, and return the result whatever its dimensions.

I'm not even sure the current behavior is really useful: if you want to output a ff vector of the same length as the ffdf, you are likely to be modifying the ffdf, and thus you can use within.ffdf(), which is better suited for this case.

edwindj commented 11 years ago

I understand and agree. The thing is that the with.ffdf() you wish is very difficult to implement efficiently. It is very inefficient for large ffdf's (since chunking is not allowed). I will add a deprecated attribute to with.ffdf since some users currently depend on it.

nalimilan commented 11 years ago

Yeah, unfortunately it may not be efficient. But for most use cases within.ffdf() is.

One case that would have been nice to handle is to make table.ff(df$a, df$b) as efficient as with.ffdf(df, table(a, b)). But for that we you would need two versions of with(): one which converts ffdf columns to R vectors, and another which keeps them as ff vectors. Maybe with.ffdf() and ffwith()? Maybe not worth the pain?

jwijffels commented 11 years ago

I personally use with.ffdf a lot and I find it extremely usefull especially when working on groups of columns of an ffdf in 1 operation in a chunked manner. This is the most usefull use case of it. As in with(myffdf[c("mycolumnawhichineed","mycolumnbwhichineed","mycolumncwhichineed"), somedifficultfunctionwhichreturnsavectororanffdf(mycolumnawhichineed, mycolumnbwhichineed, mycolumncwhichineed)]. Currently with.ffdf works if it returns values of the same length or nrow of the input data. with.ffdf is basically slow if you are not indicating the columns you need in the operation as then it will load all the data of the ffdf in RAM, while you need only the specific columns.

2013/10/23 Milan Bouchet-Valat notifications@github.com

Yeah, unfortunately it may not be efficient. But for most use cases within.ffdf() is.

One case that would have been nice to handle is to make table.ff(df$a, df$b) as efficient as with.ffdf(df, table(a, b)). But for that we you would need two versions of with(): one which converts ffdf columns to R vectors, and another which keeps them as ff vectors. Maybe with.ffdf()and ffwith()? Maybe not worth the pain?

— Reply to this email directly or view it on GitHubhttps://github.com/edwindj/ffbase/issues/10#issuecomment-26937354 .

groeten/kind regards, Jan

Jan Wijffels Statistical Data Miner www.bnosac.be | +32 486 611708

nalimilan commented 11 years ago

But what are you doing with the returned ff vector? Don't you add it to the ffdf?

I agree that with.ffdf() will be slow if it is changed to work with any return value, but I think we can avoid converting all columns to RAM objects by using delayedAssign(). With this simple solution, I think only columns that are actually accessed will be converted to RAM.

jwijffels commented 11 years ago

Most of the time, they are added to the ffdf, but sometime I just need the ff vector.

I don't know in which setting you are working but in my settings, I have already RAM issues when I load 1 ff vector of the ffdf completely in RAM. Will this be solved with delayedAssign()?

2013/10/23 Milan Bouchet-Valat notifications@github.com

But what are you doing with the returned ff vector? Don't you add it to the ffdf?

I agree that with.ffdf() will be slow if it is changed to work with any return value, but I think we can avoid converting all columns to RAM objects by using delayedAssign(). With this simple solution, I think only columns that are actually accessed will be converted to RAM.

— Reply to this email directly or view it on GitHubhttps://github.com/edwindj/ffbase/issues/10#issuecomment-26941484 .

groeten/kind regards, Jan

Jan Wijffels Statistical Data Miner www.bnosac.be | +32 486 611708

nalimilan commented 11 years ago

In my typical use cases I have from 100K to 5M rows so a few columns can be loaded in RAM without any problem. But keeping everything in RAM all the time makes R die sooner or later due to memory fragmentation, and it means my computer has almost no free RAM left. I understand in many other use cases loading a single column in RAM is absolutely not possible. delayedAssign() will not solve this at all.

If you sometimes need to get a ff vector of the same length as the ffdf's columns, but not assign it to a column of the ffdf, then indeed the current behavior of with.ffdf() is still needed. Maybe we can give it another name, as I suggested above.

jwijffels commented 11 years ago

I'm not so much in favor of renaming with.ffdf or changing it's code as you describe it as it will give RAM issues for cases where one column does not fit into RAM.

edwindj commented 11 years ago

I created a separate branch: "with" in which I implemented "standard" with behavior. Could the both of you please check if that branch works as expected?

All the tests in the code seem to work, except for the place where character operations (paste), I fixed that by replacing with with ffdfwith, which by the way does efficient chunking...

If branch "with" works as expected, I will merge it into the master branch. However this will be a breaking change, so I hope our users won't be too affected by that.

nalimilan commented 11 years ago

Thanks, this seems to work. So this solution does not imply converting columns to RAM objects at all, which is interesting when using functions like table() or sum() that have .ff versions. There are thus no worries about columns no fitting into RAM.

Maybe while we're at it an optional as.ram argument could still be useful for some cases where .ff functions are not available? I'm not sure. Anyway, here is an implementation using delayedAssign():

with.ffdf <- function (data, expr, ..., as.ram=FALSE) {
    if(as.ram) {
        env <- new.env()

        for(var in names(data))
            delayedAssign(var, as.ram(data[var]), env, env)

        eval(substitute(expr), env, enclos = parent.frame())
    }
    else {
        eval(substitute(expr), physical(data), enclos = parent.frame())
    }
}

Finally, ?ffdfwith does not say that the length/number of rows of the result must be the same as the number of rows of the ffdf. It would also be good to explain that the processing is chunked an that the results are concatenated at the end (in some cases you may return a vector of the same length as the ffdf, but which cannot be obtained by concatenating chunked results).

edwindj commented 11 years ago

@jwijffels : Did you check your scripts against the "with" branch? I would like to update CRAN, so if the "with" branch can be merged into the master that would be fine.

jwijffels commented 11 years ago

Edwin, I haven't found time yet. But I'm pretty sure it will break a looot of my scripts which I have running for clients. with is scattered all over the place in my scripts. And some basic use cases which I encounter a lot will just fail. E.g.

library(ffbase) x <- expand.ffgrid(x = ff(factor(LETTERS)), y = ff(seq.Date(Sys.Date(), Sys.Date()+365, by = "day"))) with(x, y < Sys.Date()+185)

2013/10/28 Edwin de Jonge notifications@github.com

@jwijffels https://github.com/jwijffels : Did you check your scripts against the "with" branch? I would like to update CRAN, so if the "with" branch can be merged into the master that would be fine.

— Reply to this email directly or view it on GitHubhttps://github.com/edwindj/ffbase/issues/10#issuecomment-27212085 .

groeten/kind regards, Jan

Jan Wijffels Statistical Data Miner www.bnosac.be | +32 486 611708

nalimilan commented 11 years ago

Maybe we should wait for enough time so that you are confident that you fixed the code? It should just require replacing with() by ffdfwith(), but of course better do some testing... Or you could start with a deprecation warning in the next release to help detecting the cases that need to be changed, and the behavior would only be modified later.

jwijffels commented 11 years ago

I understand the change and the argument about consistency with what the documentation says about base::with, but the whole idea of ffbase is to make life easier, more like the base package in R, such that the user can switch as easily as possible from his base code to ffbase code. But I think changing the current with to ffdfwith just changes that idea.

Let me be more clear. The basic example in a textbook of using R says, don't attach a data.frame, use with. Now the most basic example for which you use with is with(mydf, myFUN(cola, colb, colc)). Works perfectly in the current setting. But not in the new setting, in which we oblige the user to chunkify that myFUN first before this will work. To me, this is making it more complex in my opinion.

2013/10/28 Milan Bouchet-Valat notifications@github.com

Maybe we should wait for enough time so that you are confident that you fixed the code? It should just require replacing with() by ffdfwith(), but of course better do some testing... Or you could start with a deprecation warning in the next release to help detecting the cases that need to be changed, and the behavior would only be modified later.

— Reply to this email directly or view it on GitHubhttps://github.com/edwindj/ffbase/issues/10#issuecomment-27220598 .

groeten/kind regards, Jan

Jan Wijffels Statistical Data Miner www.bnosac.be | +32 486 611708

nalimilan commented 11 years ago

I think we do not really have the same use cases... :-)

My use case is that rather than using attach(mydf); table(cola, colb), I use with(mydf, table(cola, colb)). The current behavior of with.ffdf() breaks this possibility. That's not the end of the world, but that's another thing to learn when switching to ffbase.

You seem to assume in your example that myFUN() returns a vector of the same length as the ffdf. That's a different use case. Apparently we cannot really suit both needs with the same function: the whole is point is, which behavior should be the default (with.ffdf()), which should be accessed using an explicit function (ffdfwith()). Or would it be better to switch the behavior using argument to with.ffdf()?

jwijffels commented 11 years ago

Probably there are even other use cases of with except with(mydf, myFUN(cola, colb, colc)) and with(mydf, table(cola, colb)). When looking at the examples in base::with, there is boxplot, glm and I can imagine a lot of different other more complex functionality.

None of these last however will work with the new suggested version of with, nor with the old one and they only work if data is completely in RAM. And correct me if I'm wrong, I don't believe we can assume that these functions will work on ff vectors which are not in RAM, so why go along that path because it will only work for basic things like table or summary and only if they are implemented in ffbase and that at the detriment of not allowing with(mydf, myFUN(cola, colb, colc))?

2013/10/28 Milan Bouchet-Valat notifications@github.com

I think we do not really have the same use cases... :-)

My use case is that rather than using attach(mydf); table(cola, colb), I use with(mydf, table(cola, colb)). The current behavior of with.ffdf()breaks this possibility. That's not the end of the world, but that's another thing to learn when switching to ffbase.

You seem to assume in your example that myFUN() returns a vector of the same length as the ffdf. That's a different use case. Apparently we cannot really suit both needs with the same function: the whole is point is, which behavior should be the default (with.ffdf()), which should be accessed using an explicit function (ffdfwith()). Or would it be better to switch the behavior using argument to with.ffdf()?

— Reply to this email directly or view it on GitHubhttps://github.com/edwindj/ffbase/issues/10#issuecomment-27265058 .

groeten/kind regards, Jan

Jan Wijffels Statistical Data Miner www.bnosac.be | +32 486 611708

edwindj commented 11 years ago

My 2 cents: the proposed new with is closest to the semantics of base R.

The Right Thing To Do would be to merge the with branch and advise users to use ffdfwith. However, that will cause that many ffbase script stop working (correctly), so I'm hesistant to do so. (maybe in ffbase2 :-) )

jwijffels commented 11 years ago

like roxygen2 or reshape2 or dplyr2 ;)

how about adding a global option like package ff does? options(ffbasewith = ...) and depending on the option, do the appropriate with.ffdf or ffdfwith. In that way the user can specify his behaviour in the global options and he does not need to change his scripts?

2013/10/29 Edwin de Jonge notifications@github.com

My 2 cents: the proposed new with is closest to the semantics of base R.

The Right Thing To Do would be to merge the with branch and advise users to use ffdfwith. However, that will cause that many ffbase script stop working (correctly), so I'm hesistant to do so. (maybe in ffbase2 :-) )

— Reply to this email directly or view it on GitHubhttps://github.com/edwindj/ffbase/issues/10#issuecomment-27300989 .

groeten/kind regards, Jan

Jan Wijffels Statistical Data Miner www.bnosac.be | +32 486 611708

nalimilan commented 11 years ago

I think it would be nice to start with a deprecation phase (see e.g. [1]), with a new option allowing to change the behavior, but defaulting to the old one. Then in a few weeks/months, change the default value of the option.

1: http://www.bioconductor.org/developers/how-to/deprecation/