fonhorst / LightAutoML_Spark

Apache License 2.0
7 stars 1 forks source link

H2O AutoML #9

Closed fonhorst closed 2 years ago

vipmaax commented 2 years ago

Github stars: 5.6k
Last release: v3.34.0.3-1 (2021-10-08) Language: java, python, R Supported data types varies per algorithm. • All H2O-3 algos accept data as numerical and categorical. • Word2Vec accepts data as text.

H2O supports the following supervised algorithms: • AutoML: Automatic Machine Learning • Cox Proportional Hazards (CoxPH) • Deep Learning (Neural Networks) • Distributed Random Forest (DRF) • Generalized Linear Model (GLM) • Maximum R Square Improvements (MAXR) • Generalized Additive Models (GAM) • ANOVA GLM • Gradient Boosting Machine (GBM) • Naïve Bayes Classifier • RuleFit • Stacked Ensembles • Support Vector Machine (SVM) • XGBoost

H2O supports the following unsupervised algorithms: • Aggregator • Generalized Low Rank Models (GLRM) • Isolation Forest • Extended Isolation Forest • K-Means Clustering • Principal Component Analysis (PCA)

Sparkling Water integrates H2O's fast scalable machine learning engine with Spark. It provides:

Utilities to publish Spark data structures (RDDs, DataFrames, Datasets) as H2O's frames and vice versa. DSL to use Spark data structures as input for H2O's algorithms. Basic building blocks to create ML applications utilizing Spark and H2O APIs. Python interface enabling use of Sparkling Water directly from PySpark.

https://docs.h2o.ai/sparkling-water/3.1/latest-stable/doc/index.html

Sparkling Water Backends: Internal Backend. This solution is easiest to deploy; however, when Spark or YARN kills the executor the whole H2O cluster goes down since H2O does not support high availability. External Backend. In the external cluster, we use the H2O cluster running separately from the rest of the Spark application. This separation gives us more stability because we are no longer affected by Spark executors being killed, which can lead (as in the previous mode) to h2o cluster being killed as well. There are two deployment strategies of the external cluster: manual and automatic. In manual mode, we need to start the H2O cluster, and in the automatic mode, the cluster is started for us automatically based on our configuration. In Hadoop environments, the creation of the cluster is performed by a simple process called H2O driver. In automatic mode, the H2O cluster is started automatically. The cluster can be started automatically only in YARN environment at the moment. In manual mode, we need to start the H2O cluster before connecting to it manually. At this section, we will start the cluster on Hadoop.

Sparkling Water can be executed inside the Kubernetes cluster. Sparkling Water supports Kubernetes since Spark version 2.4.

Sparkling Water provides API in Scala and Python for H2O AutoML

vipmaax commented 2 years ago
Снимок экрана 2021-10-21 в 17 10 08

)

vipmaax commented 2 years ago

A.2 Scalability & Parallelism To scale training to datasets that cannot fit inside the memory (RAM) of a single machine, you simply add compute nodes to your H2O cluster. The training set will be distributed across multiple machines, row-wise (a full row is contained in a single node, so different rows will be stored on different nodes in the cluster). The Java implementations of the distributed machine learning algorithms inside H2O are highly optimized and the training speeds are further accelerated by parallelized training. To reduce communication overhead between nodes, data is compressed and uncompressed on the fly. Within the algorithm implementations, there are many optimizations made to speed up the algorithms. Examples of such optimizations include pre-computing histograms (used extensively in tree-based methods), so they are available on-demand when needed. H2O GBM and Random Forest utilize group-splits for categorical columns, meaning that one-hot encoding (large memory cost) or label encoding (loss of categorical nature) is not neces- sary. Cutting edge optimization techniques such as the alternating direction method of multipliers (ADMM) (Boyd et al., 2011), an algorithm that solves convex optimization problems by breaking them into smaller pieces, are used extensively, providing both speed and scalability. Optimization techniques are also selected dynamically based on data size and shape for further speed-up. For example, the H2O GLM uses a iteratively reweighted least squares method (IRLSM) with a Gram Matrix approach, which is efficient for tall and narrow datasets and when running lambda search via a sparse solution. For wider and dense datasets (thousands of predictors and up), the limited-memory Broyden-Fletcher-Goldfarb- Shanno(L-BFGS)solverscalesbetter,sointhatcase,itwillbeusedautomatically.23 H2O Deep Learning24 includes many optimizations for speeding up training such as the HOG- WILD! (Recht et al., 2011) approach to parelleizing stochastic gradient descent. When the associated optimization problem is sparse, meaning most gradient updates only mod- ify small parts of the decision variable, then HOGWILD! achieves a nearly optimal rate of convergence. Here we have listed a few notable examples, but there are many other optimizations included in H2O algorithms, designed to promote speed and scalability. Random search is an embarrassingly parallel task, which offers additional opportunity for speed-up. The current stable version of H2O AutoML (H2O 3.30.0.3) parallelizes training within a single model and partially for cross-validation across all cores of the H2O cluster, however the random search is executed serially, training a single model at any given time. H2O grid/random searches have a parallel argument which allows the user to specify how many models will be trained at once on the H2O cluster. Automatic parallelization of model training, in which we dynamically decide how many models to train in parallel, is a work in progress and is planned for a future release of H2O AutoML. The goal is to maximize the number of models that can be trained at once, given training set size and compute resources, without overloading the system.

https://www.automl.org/wp-content/uploads/2020/07/AutoML_2020_paper_61.pdf

vipmaax commented 2 years ago

3.2 Scalability & Speed Due to the efficient implementations of the algorithms and the distributed nature of the H2O platform, H2O AutoML can scale to large datasets (e.g. 100M+ rows) as shown in Section E.1. Training of individual models is parallelized across CPU cores on a single machine, or across a cluster of networked machines in a multinode setting. XGBoost models, which are included in the AutoML algorithm by default, also support GPU acceleration for further speed-up in training. Since a significant portion of training is usually dedicated to XGBoost models, H2O AutoML benefits from GPU acceleration. Appendix A includes a more detailed discussion about the architecture and scalability of the H2O platform. One of the benefits of building an AutoML system on top of a fast, scalable machine learning library, is that you can utilize speed and parallelism to train more models in the same amount of time as compared to AutoML libraries that are built with slower or less scal- able underlying algorithms. As demonstrated in the OpenML AutoML benchmark results in Figure 4, this allows us to use simple, straight-forward techniques like random search and stacking to achieve excellent performance in the same amount of time as algorithms which use more complex tuning techniques such as Bayesian optimization or genetic algorithms.