Azure-Samples / Azure-MachineLearning-DataScience

Creative Commons Attribution 4.0 International
409 stars 369 forks source link

H2O performance comparison has major flaws #42

Closed ledell closed 6 years ago

ledell commented 7 years ago

Hi, I saw some of the benchmarks blogged about here from a recent Strata presentation slidedeck.

There are major flaws in your benchmarking of H2O:

The point of using H2O's Sparkling Water (and rsparkling if you are using R) is to interact with data already in the Spark cluster. When you have data on disk, then you should be using the h2o.importFile() function (to do a parallel read from disk into the H2O cluster) and the h2o package for modeling. There is no need to use rsparkling at all.

Loading to disk into Spark, then from Spark into H2O is an unnecessary task and doing so misrepresents the computational efficiency of H2O as compared to the other tools in this benchmark. In the interest of honest & accurate benchmarking practices, it would be great if you could revise the benchmark to reflect this. If you have any questions on how to do this, please let me know.

All you need to do is load the data from disk using h2o.importFile() and then execute these rows of the benchmark. You can also compute performance directly in H2O using h2o.performace() rather than generating predicted values using h2o.predict(), however there is nothing wrong with generating the predictions and calculating performance metrics manually, it's just faster if you use H2O's h2o.performance() function. To most efficiently write the predictions back to disk, you should be using the h2o.exportFile() function.

mezmicrosoft commented 7 years ago

Thanks for the feedback. We have updated the presentation deck (slide #61) to acknowledge that loading data using h2o.importFile() from the h2o package is likely a more efficient that the approach provided in the slide for h2o.