Databricks AutoML toolkit

fonhorst / LightAutoML_Spark

Apache License 2.0

7 stars 1 forks source link

Toolkit for Apache Spark ML for Feature clean-up, feature Importance calculation suite, Information Gain selection, Distributed SMOTE, Model selection and training, Hyper parameter optimization and selection, Model interprability.

https://github.com/databrickslabs/automl-toolkit

1) github stars: 161

2) last release: 0.8.1 (may 26, 2021) first release: 0.7.0 (Feb 23, 2020)

3) Language: Scala 2.12, python (pyspark)

Currently supported models:

"XGBoost" - XGBoost Classifier or XGBoost Regressor

"RandomForest" - Random Forest Classifier or Random Forest Regressor

"GBT" - Gradient Boosted Trees Classifier or Gradient Boosted Trees Regressor

"Trees" - Decision Tree Classifier or Decision Tree Regressor

"LinearRegression" - Linear Regressor

"LogisticRegression" - Logistic Regressor (supports both Binomial and Multinomial)

"MLPC" - Multi-Layer Perceptron Classifier

"SVM" - Linear Support Vector Machines

"LightGBM" (currently suspended, pending library improvements to LightGBM) LightGBM

https://github.com/databrickslabs/automl-toolkit/tree/master/src/main/scala/com/databricks/labs/automl/model

GitHub: Stars - 161, Forks - 35, Contributors - 7
Releases: Last Release - 0.8.1 (28th May 2021), First Release - (7th March 2020)
Language: Scala, Python (pyspark)
Data Types: Structured
Execution backends: Spark
Distributed framework: Spark
Distributed algorithms libraries: SparkML
AutoML pipeline elements:

Feature clean-up
Feature Importance calculation suite
Feature Interaction with Information Gain selection
Feature vectorization
Advanced train/test split techniques
Model selection and training
Hyper parameter optimization and selection
Batch Prediction
Logging of model results and training runs
Model interprability

Elements of pipeline are covered in distributed mode: all
Is it possible to mix distributed and non-distributed backends?: No
How is the pipeline converted to computing applications?: Distributed application
Approach to ML models selection computing: Multiple distributed models ?
Data processing steps pruning: Yes
Resource managers: Spark's resource managers
Hyperparameter tuner: Spark ML
Task types: Classification, Regression
Tests: local mode, distributed mode.
Comments: --

fonhorst / LightAutoML_Spark

Databricks AutoML toolkit #4