Closed ledell closed 6 years ago
Thanks for the feedback. We have updated the presentation deck (slide #61) to acknowledge that loading data using h2o.importFile()
from the h2o
package is likely a more efficient that the approach provided in the slide for h2o
.
Hi, I saw some of the benchmarks blogged about here from a recent Strata presentation slidedeck.
There are major flaws in your benchmarking of H2O:
The point of using H2O's Sparkling Water (and rsparkling if you are using R) is to interact with data already in the Spark cluster. When you have data on disk, then you should be using the
h2o.importFile()
function (to do a parallel read from disk into the H2O cluster) and the h2o package for modeling. There is no need to use rsparkling at all.Loading to disk into Spark, then from Spark into H2O is an unnecessary task and doing so misrepresents the computational efficiency of H2O as compared to the other tools in this benchmark. In the interest of honest & accurate benchmarking practices, it would be great if you could revise the benchmark to reflect this. If you have any questions on how to do this, please let me know.
All you need to do is load the data from disk using
h2o.importFile()
and then execute these rows of the benchmark. You can also compute performance directly in H2O usingh2o.performace()
rather than generating predicted values usingh2o.predict()
, however there is nothing wrong with generating the predictions and calculating performance metrics manually, it's just faster if you use H2O'sh2o.performance()
function. To most efficiently write the predictions back to disk, you should be using theh2o.exportFile()
function.