Open nalimilan opened 11 years ago
Hi, I think with.ffdf does not handle objects of class table. It only handles things that return vectors, factors, Dates, POSIXct and data.frames but not tables currently or other objects. I think that might be the reason of you finding differences between your 2 ffdf objects as you use with.ffdf together with table.
This makes sense - I believe - as if you use with.ffdf to get another class - say lm or glm, you would need a way on how to combine the results of the different chunks (like combining the results of lm or glm objects).
If your purpose is to test if the merge.ffdf did the right job, maybe you can construct other tests apart from using table inside with.ffdf. Maybe use table.ff directly or table.ff(as.character(yourffvector))?
Correct me if I'm wrong here.
No, my purpose was, well... to create cross tables. ;-)
I don't think the problem is specific to tables, as you can see that with(couples, C01AN[54356])
returns a ff vector (why no just one value, I don't understand) with the first value being NA, which is not the case with with(adu, C01AN[54356])
.
If you want just cross tables. table.ff(as.character(couples$C01AN), as.character(adu$C01AN)) will do.
Apart from that cross classification:
Basically with.ffdf was designed to add another column to an ffdf in mind. With the same length as the ffdf. In with.ffdf, there is this code before the chunking starts:
length(res) <- nrow(data)
which basically reflects this idea that it should return something of the same length
Maybe we should add checks which stop the execution of with.ffdf if the expression returns something different than vectors, factors, Dates, POSIXct and data.frames and if the total length of the returned thing does not equal the nrow or length of the original data.
Sure, I can use table.ff(), and I do most of the time. The with() syntax is just a habit I have taken to save typing the name of the data frame repeatedly. I don't think with.ffdf() should necessarily work in that case, but if it does it must return correct results.
Your suggestion of checking the length and type of the return value makes sense. But maybe you could simply return the RAM object in case it does not fill the conditions. This would allow both uses of with.ffdf(). Anyway, if it does not support all common use cases of with(), I think the docs need to be more explicit about that.
with.ffdf
indeed assumes that:
ff
(or ffdf
) objectOther cases don't work as a normal R object would.
I will adjust the documentation. I could add a check if the with statement returns an atomic vector or data.frame (which will be converted to a ff
or ffdf
object). What your opinon on that?
Fixed it in github.
Could you please check if it works as expected? It passes our test cases, but we may have forgotten some.
Best
Thanks, but it does not seem to fix the problem, since a 1D table is atomic. So I get the same result, without any error:
identical(with(adu, table(C01AN)), with(couples, table(C01AN)))
[1] FALSE
I think you should check the length of the result is equal to the length of the ffdf, as you said. Also, I don't think the docs should say that the result will not be correct: the function should simply trigger an error. No need to scare people by stating that results might be wrong... and this is safer.
But what about returning the R object without converting it to a ff vector if it does not have the correct length? I don't see how this could hurt...
Changed it, to stop when it happens. Problem is that the code is processed chunk wise and in general it is not possible to know how to combine the results of each chunk. The code assumes it can append it (but if that makes sense depends entirely on the expression that was given to with.ffdf
.
However there is (an, I admit, very unlikely) case when this will fail: if the length of the results equals each length of the chunks, but not the total length of the ffdf
.
e.g. if your table
example happens to return a table of the same length as each chunk, then the result will still be wrong. Do you have any suggestions on how to detect that one?
OK, the new code seems to catch the table case.
I don't really understand how each chunked result could have the same length as each chunk, but the final result not have the same length as the ffdf. I guess it depends on how you combine chunked results. But then why can't you compare the size of the final result to that of the ffdf? I'm missing something...
Well it is unlikely, but the following can happen:
iris.ffdf <- as.ffdf(iris)
with(iris.ffdf, table(Species)) # fails because size table = 3
with(iris.ffdf, table(Species), by=3) # succeeds because chunk size equals table size, but result is wrong...
table(iris.ffdf$Species)
OK, I see how this works now. But I still don't get why with.ffdf() does not return a vector concatenating the tables obtained for all chunks of the ffdf. This would at least make obvious that something did not go as expected.
After using ffbase in more situations, and getting bitten once again, I really think with.ffdf() as it currently behaves should be renamed to something else. In base R, with() imposes no restriction on the dimensions of the return value, and nothing in the docs suggest that the return value should be similar to the input data frame. Thus, users expectations about with.ffdf() are likely to lead to misunderstandings just like mine. with.ffdf() should just evaluate its arguments in an environment containing the variables of the ffdf object, and return the result whatever its dimensions.
I'm not even sure the current behavior is really useful: if you want to output a ff vector of the same length as the ffdf, you are likely to be modifying the ffdf, and thus you can use within.ffdf(), which is better suited for this case.
I understand and agree. The thing is that the with.ffdf()
you wish is very difficult to implement efficiently. It is very inefficient for large ffdf
's (since chunking is not allowed). I will add a deprecated attribute to with.ffdf since some users currently depend on it.
Yeah, unfortunately it may not be efficient. But for most use cases within.ffdf() is.
One case that would have been nice to handle is to make table.ff(df$a, df$b)
as efficient as with.ffdf(df, table(a, b))
. But for that we you would need two versions of with()
: one which converts ffdf columns to R vectors, and another which keeps them as ff vectors. Maybe with.ffdf()
and ffwith()
? Maybe not worth the pain?
I personally use with.ffdf a lot and I find it extremely usefull especially when working on groups of columns of an ffdf in 1 operation in a chunked manner. This is the most usefull use case of it. As in with(myffdf[c("mycolumnawhichineed","mycolumnbwhichineed","mycolumncwhichineed"), somedifficultfunctionwhichreturnsavectororanffdf(mycolumnawhichineed, mycolumnbwhichineed, mycolumncwhichineed)]. Currently with.ffdf works if it returns values of the same length or nrow of the input data. with.ffdf is basically slow if you are not indicating the columns you need in the operation as then it will load all the data of the ffdf in RAM, while you need only the specific columns.
2013/10/23 Milan Bouchet-Valat notifications@github.com
Yeah, unfortunately it may not be efficient. But for most use cases within.ffdf() is.
One case that would have been nice to handle is to make table.ff(df$a, df$b) as efficient as with.ffdf(df, table(a, b)). But for that we you would need two versions of with(): one which converts ffdf columns to R vectors, and another which keeps them as ff vectors. Maybe with.ffdf()and ffwith()? Maybe not worth the pain?
— Reply to this email directly or view it on GitHubhttps://github.com/edwindj/ffbase/issues/10#issuecomment-26937354 .
groeten/kind regards, Jan
Jan Wijffels Statistical Data Miner www.bnosac.be | +32 486 611708
But what are you doing with the returned ff vector? Don't you add it to the ffdf?
I agree that with.ffdf()
will be slow if it is changed to work with any return value, but I think we can avoid converting all columns to RAM objects by using delayedAssign()
. With this simple solution, I think only columns that are actually accessed will be converted to RAM.
Most of the time, they are added to the ffdf, but sometime I just need the ff vector.
I don't know in which setting you are working but in my settings, I have already RAM issues when I load 1 ff vector of the ffdf completely in RAM. Will this be solved with delayedAssign()?
2013/10/23 Milan Bouchet-Valat notifications@github.com
But what are you doing with the returned ff vector? Don't you add it to the ffdf?
I agree that with.ffdf() will be slow if it is changed to work with any return value, but I think we can avoid converting all columns to RAM objects by using delayedAssign(). With this simple solution, I think only columns that are actually accessed will be converted to RAM.
— Reply to this email directly or view it on GitHubhttps://github.com/edwindj/ffbase/issues/10#issuecomment-26941484 .
groeten/kind regards, Jan
Jan Wijffels Statistical Data Miner www.bnosac.be | +32 486 611708
In my typical use cases I have from 100K to 5M rows so a few columns can be loaded in RAM without any problem. But keeping everything in RAM all the time makes R die sooner or later due to memory fragmentation, and it means my computer has almost no free RAM left. I understand in many other use cases loading a single column in RAM is absolutely not possible. delayedAssign()
will not solve this at all.
If you sometimes need to get a ff vector of the same length as the ffdf's columns, but not assign it to a column of the ffdf, then indeed the current behavior of with.ffdf()
is still needed. Maybe we can give it another name, as I suggested above.
I'm not so much in favor of renaming with.ffdf or changing it's code as you describe it as it will give RAM issues for cases where one column does not fit into RAM.
I created a separate branch: "with" in which I implemented "standard" with
behavior.
Could the both of you please check if that branch works as expected?
All the tests in the code seem to work, except for the place where character operations (paste
), I fixed that by replacing with
with ffdfwith
, which by the way does efficient chunking...
If branch "with" works as expected, I will merge it into the master branch. However this will be a breaking change, so I hope our users won't be too affected by that.
Thanks, this seems to work. So this solution does not imply converting columns to RAM objects at all, which is interesting when using functions like table()
or sum()
that have .ff versions. There are thus no worries about columns no fitting into RAM.
Maybe while we're at it an optional as.ram
argument could still be useful for some cases where .ff functions are not available? I'm not sure. Anyway, here is an implementation using delayedAssign()
:
with.ffdf <- function (data, expr, ..., as.ram=FALSE) {
if(as.ram) {
env <- new.env()
for(var in names(data))
delayedAssign(var, as.ram(data[var]), env, env)
eval(substitute(expr), env, enclos = parent.frame())
}
else {
eval(substitute(expr), physical(data), enclos = parent.frame())
}
}
Finally, ?ffdfwith does not say that the length/number of rows of the result must be the same as the number of rows of the ffdf. It would also be good to explain that the processing is chunked an that the results are concatenated at the end (in some cases you may return a vector of the same length as the ffdf, but which cannot be obtained by concatenating chunked results).
@jwijffels : Did you check your scripts against the "with" branch? I would like to update CRAN, so if the "with" branch can be merged into the master that would be fine.
Edwin, I haven't found time yet. But I'm pretty sure it will break a looot of my scripts which I have running for clients. with is scattered all over the place in my scripts. And some basic use cases which I encounter a lot will just fail. E.g.
library(ffbase) x <- expand.ffgrid(x = ff(factor(LETTERS)), y = ff(seq.Date(Sys.Date(), Sys.Date()+365, by = "day"))) with(x, y < Sys.Date()+185)
2013/10/28 Edwin de Jonge notifications@github.com
@jwijffels https://github.com/jwijffels : Did you check your scripts against the "with" branch? I would like to update CRAN, so if the "with" branch can be merged into the master that would be fine.
— Reply to this email directly or view it on GitHubhttps://github.com/edwindj/ffbase/issues/10#issuecomment-27212085 .
groeten/kind regards, Jan
Jan Wijffels Statistical Data Miner www.bnosac.be | +32 486 611708
Maybe we should wait for enough time so that you are confident that you fixed the code? It should just require replacing with()
by ffdfwith()
, but of course better do some testing... Or you could start with a deprecation warning in the next release to help detecting the cases that need to be changed, and the behavior would only be modified later.
I understand the change and the argument about consistency with what the documentation says about base::with, but the whole idea of ffbase is to make life easier, more like the base package in R, such that the user can switch as easily as possible from his base code to ffbase code. But I think changing the current with to ffdfwith just changes that idea.
Let me be more clear. The basic example in a textbook of using R says, don't attach a data.frame, use with. Now the most basic example for which you use with is with(mydf, myFUN(cola, colb, colc)). Works perfectly in the current setting. But not in the new setting, in which we oblige the user to chunkify that myFUN first before this will work. To me, this is making it more complex in my opinion.
2013/10/28 Milan Bouchet-Valat notifications@github.com
Maybe we should wait for enough time so that you are confident that you fixed the code? It should just require replacing with() by ffdfwith(), but of course better do some testing... Or you could start with a deprecation warning in the next release to help detecting the cases that need to be changed, and the behavior would only be modified later.
— Reply to this email directly or view it on GitHubhttps://github.com/edwindj/ffbase/issues/10#issuecomment-27220598 .
groeten/kind regards, Jan
Jan Wijffels Statistical Data Miner www.bnosac.be | +32 486 611708
I think we do not really have the same use cases... :-)
My use case is that rather than using attach(mydf); table(cola, colb)
, I use with(mydf, table(cola, colb))
. The current behavior of with.ffdf()
breaks this possibility. That's not the end of the world, but that's another thing to learn when switching to ffbase.
You seem to assume in your example that myFUN()
returns a vector of the same length as the ffdf. That's a different use case. Apparently we cannot really suit both needs with the same function: the whole is point is, which behavior should be the default (with.ffdf()
), which should be accessed using an explicit function (ffdfwith()
). Or would it be better to switch the behavior using argument to with.ffdf()
?
Probably there are even other use cases of with except with(mydf, myFUN(cola, colb, colc)) and with(mydf, table(cola, colb)). When looking at the examples in base::with, there is boxplot, glm and I can imagine a lot of different other more complex functionality.
None of these last however will work with the new suggested version of with, nor with the old one and they only work if data is completely in RAM. And correct me if I'm wrong, I don't believe we can assume that these functions will work on ff vectors which are not in RAM, so why go along that path because it will only work for basic things like table or summary and only if they are implemented in ffbase and that at the detriment of not allowing with(mydf, myFUN(cola, colb, colc))?
2013/10/28 Milan Bouchet-Valat notifications@github.com
I think we do not really have the same use cases... :-)
My use case is that rather than using attach(mydf); table(cola, colb), I use with(mydf, table(cola, colb)). The current behavior of with.ffdf()breaks this possibility. That's not the end of the world, but that's another thing to learn when switching to ffbase.
You seem to assume in your example that myFUN() returns a vector of the same length as the ffdf. That's a different use case. Apparently we cannot really suit both needs with the same function: the whole is point is, which behavior should be the default (with.ffdf()), which should be accessed using an explicit function (ffdfwith()). Or would it be better to switch the behavior using argument to with.ffdf()?
— Reply to this email directly or view it on GitHubhttps://github.com/edwindj/ffbase/issues/10#issuecomment-27265058 .
groeten/kind regards, Jan
Jan Wijffels Statistical Data Miner www.bnosac.be | +32 486 611708
My 2 cents: the proposed new with
is closest to the semantics of base R.
The Right Thing To Do would be to merge the with branch and advise users to use ffdfwith
. However, that will cause that many ffbase script stop working (correctly), so I'm hesistant to do so. (maybe in ffbase2 :-) )
like roxygen2 or reshape2 or dplyr2 ;)
how about adding a global option like package ff does? options(ffbasewith = ...) and depending on the option, do the appropriate with.ffdf or ffdfwith. In that way the user can specify his behaviour in the global options and he does not need to change his scripts?
2013/10/29 Edwin de Jonge notifications@github.com
My 2 cents: the proposed new with is closest to the semantics of base R.
The Right Thing To Do would be to merge the with branch and advise users to use ffdfwith. However, that will cause that many ffbase script stop working (correctly), so I'm hesistant to do so. (maybe in ffbase2 :-) )
— Reply to this email directly or view it on GitHubhttps://github.com/edwindj/ffbase/issues/10#issuecomment-27300989 .
groeten/kind regards, Jan
Jan Wijffels Statistical Data Miner www.bnosac.be | +32 486 611708
I think it would be nice to start with a deprecation phase (see e.g. [1]), with a new option allowing to change the behavior, but defaulting to the old one. Then in a few weeks/months, change the default value of the option.
1: http://www.bioconductor.org/developers/how-to/deprecation/
This looks very serious.
couples is a ffdf which is the result of merging adu with another ffdf, using all.x=TRUE. Thus, all the variables below are the same in couples and adu, and the number of rows is the same.
Note how value 1894 is missing from the output of
with(couples, table(C01AN[]))
, while it is in that oftable(couples$C01AN[])
, and actually in the vector:More interesting:
So it looks like value 54356 is the first one that is different from with.ffd()'s point of view. And indeed:
So something is broken in the chunks logic of with.ffdf(). Do you think there might be more bugs like that lurking around? That's pretty annoying when you run analyses using it (or other features of ffbase)...