RevolutionAnalytics / RHadoop

RHadoop
https://github.com/RevolutionAnalytics/RHadoop/wiki
763 stars 278 forks source link

Efficiency review #170

Closed piccolbo closed 11 years ago

piccolbo commented 11 years ago

An umbrella issue for speed optimizations of various kinds

piccolbo commented 11 years ago

In memory combiner: 14cda71e22ceb01d2487c758ff8827dc2d4ef1da New timings for benchmarks: ecea6167499b59d187bba0950f867146e120e1cf Eliminated capture output: e179eee315fa3ac8b44e53b778e4a4721fc8b250 592f2a01cf3813c09c6a5a7739c149ed97b62259 Read a min of 1MB of data: 1ba3ca193643c143251cae725b26c13b7625d84a Rewrite of reduce loop: 697e1932c92eee8073e7f5665264dafc71b19119 two optimizations for important special cases d45d383ea9f4eed111c9d79103722c0687534326 7ef9c0a388b44b1718b3b9fea923e43609327130 decouple map and reduce vectorization setting 5c6e1a11bfb17f6e05ed8d15a109daa52944170f new test case from cludera blog f693c9d07ed5ee4edd1dbfa08d276942165eee27

And a host of related changes

piccolbo commented 11 years ago

Can't list all the commits but additional work went towards defining a vectorized version of reduce and a deep refactor of the C deserialization code, eliminating all unnecessary calls out to the R interpreter and providing a better foundation for extensions (lots of duplicate code gone). The idea is to progressively replace R serialization while maintaining compatibility. On my laptop we are within 6X java but in a more realistic setting (EC2, 5 nodes) we are within 1.2X java, based on the collocations.R example taken from the Cloudera blog, a difficult task because of the large number of small reduce groups. Achieving this performance has required writing 10 lines of C++ to speed up the execution of a large number of small sums sapply(list.of.vecs, sum).