"sum" is 10x slower than expected

edwindj / ffbase

Basic (statistical) functionality for R package ff

github.com/edwindj/ffbase/wiki

35 stars 15 forks source link

"sum" is 10x slower than expected #52

Open archenemies opened 8 years ago

archenemies commented 8 years ago

Since 'ff' uses mmap, I would have thought that the implementation of sum in 'ffbase' would be pretty simple - just iterate over the mmap'ed address range while updating an accumulator.

However, benchmarking sum of a long vector seems to indicate that something more complicated is happening:

> x1=rep(0.0,1e8)
> x2=as.ff(x1)
> system.time(sum(x1))
   user  system elapsed 
  0.077   0.003   0.079 
> system.time(sum(x2))     # ran this 3 times
   user  system elapsed 
  0.527   0.333   0.859 
  0.487   0.880   1.369 
  0.440   0.507   0.946

I noticed the large value for 'system' time, so I attached strace to the R process, and ran sum(x2) again. I found that mmap is being called 12208 times! For sum(x1), mmap was not called at all...

Note that the 0.079 seconds used by sum(x1) is almost exactly the time required to cat the 'ff' file which backs x2. Thus the file is located entirely in the page cache, and this 0.079 seconds simply represents the time required to read 763MB of RAM data into the CPU caches.

Now I downloaded the source of ffbase, and I could not find where sum.ff is defined, so I was not able to investigate further today. By the way, this question of efficiency arose out of an email exchange with 'ff' author Jens Oehlschlägel.

Thanks for your library, and thank you in advance for assistance with this issue.

edwindj commented 8 years ago

Thanks for this issue :-). You are completely right. The current ffbase for sum (and the like) is that is an "all R" implementation. Each ff vector is loaded chunk by chunk and processing takes place on these chunks. So each data chunk is copied to a R vector. (hence the current performance). This was done to get identical behavior compared to normal R vectors (things like NA etc.). (implementation is in https://github.com/edwindj/ffbase/blob/master/pkg/R/Summary_ff.R

As you rightly assume this can be made faster by implementing a C/C++ function with an accumalator, which I should, but not had the time to do so. (will do in a couple of weeks...)

Best regards,

Edwin

archenemies commented 8 years ago

Thank you for your reply, Edwin.

Before you implement the faster version of 'sum', I also wanted to broach the possibility of encapsulating an 'mmaped' ff vector as a native R vector. Apparently the data in an R vector is stored contiguously in RAM, but is prefixed by a short header which contains the length of the vector as well as some other information:

https://cran.r-project.org/doc/manuals/r-release/R-ints.html#The-_0027data_0027

The extra header data could pose a problem with 'mmap' - either the 'ff' format would have to be changed to include the header, or you'd have to separately mmap an extra page with the header at the end of it, before the 'ff' file data, to simulate an R vector.

The other problem is figuring out when R does things to the vector, such as freeing it or resizing it, which 'ff' would need to know about. At the very least, you could set read-only permissions on the memory region(s) containing the vector data, and trap SIGSEGV - but this might not be necessary.

Aside from these issues it seems like it should not be too difficult to produce something which appears to be a native R vector but which is actually backed by a flat file.

Although the implementation might require learning about R internals, it might end up being much cleaner in the end - by allowing you to avoid re-implementing every function like 'sum' which is already defined on native R vectors. It would also make 'ffbase' usable by third-party packages containing C code that operates on native R vectors.

Even if you are not able to simulate a mutable R vector, note that 'sum' and many other functions do not modify their input. So if someone were to implement a function 'ff_vector_to_immutable_r_vector', then you could implement 'sum' and friends by wrapping their arguments with such a function, and calling the native R versions...

edwindj commented 8 years ago

Thanks for exploring these options. I think a mmapped version of R vectors is a great idea, however the problems you describe might be very tricky. I'm sure open to helping you (I'm a bit time constrained this month).

Related problems:

R has copy-on-modify semantics: if a function internally writes to a R vector it makes a copy. Making copies of mmap vectors can be expensive.
You'll have to somehow intercept R function calls to allocating/copying vectors. Not sure if that is possible...
Functions may allocate temporary vectors for processing e.g. x > 1 creates a logical vector. We have (too?) little control over how these vectors are allocated.

joehl commented 8 years ago

R's internal vectors are under R's Garbage collector. I have no clue how one could wrap mmapped memory into R's vectors without ending up in disaster. That's why we actually copy from mmapped memory into R's memory (before using such as summing). Maybe you want to check R internals for a possibility to deactivate garbage collection on certain SEXPs? If that's possible one could indeed implement something much faster. However, that would be an issue rather affecting ff than ffbase.

archenemies commented 8 years ago

I guess you would want to create "fake" R vectors with "refs" set to 2, so that they don't get garbage collected. I learned about "refs" here:

http://adv-r.had.co.nz/memory.html#modification

If you ensured that such "fake" vectors were only created and used in wrapper functions, then you could handle deleting them at the end of the wrapper (this would just mean unmapping some memory). That's not as nice as having an actual vector object to work with, but apparently you can't attach a finalizer to a plain vector:

https://stat.ethz.ch/R-manual/R-devel/library/base/html/reg.finalizer.html

Another approach would be to have the ff_vector clean up the fake vector, and just remind users that they need to keep a reference to the ff_vector as long as the fake vector is being used. I think the second approach would be nicer because it would be more efficient - you only have to associate a single fake R vector with each ff_vector, rather than constructing a new one every time a primitive function is called. For instance, store the fake vector as a "vec" attribute in the ff_vector object, and remind users to always refer to it in that way:

> f <- ff(0:1, length=36e6,filename="./v3")
> sum(f@vec)    # users can do this
[1] 18000000
> v <- f@vec    # this is discouraged
...

I could agree that it sounds somewhat messy. Maybe there are some pitfalls to this approach that I'm not thinking of. I'm not very knowledgeable about R internals, or even the internals of ff or ffbase! Anyway, I hope that my suggestion provided some potentially useful ideas.

joehl commented 8 years ago

Frederik,

What about creating a prototype package with the minimal machinery (creating, usage example, destroying with finalizer) and try what R-devel says? Finally CRAN maintainers would need to accept that. I actually doubt that they would accept such hackery. But maybe you can make it safe enough to convince them. Once that is clarified, we could see whether future versions of ff (or a similar package) could go such route.

Kind regards

Jens Oehlschlägel

archenemies commented 8 years ago

Thanks for the suggestion, Jens. I'll think about it - not sure if I'll have time, but it would be a cool project, probably right at the limit of my skills.

Meanwhile, I guess you can implement an ffbase version of every primitive function in R. You'll probably win. :)