fonhorst / LightAutoML_Spark

Apache License 2.0
7 stars 1 forks source link

Databricks AutoML toolkit #4

Closed fonhorst closed 2 years ago

vipmaax commented 2 years ago

Toolkit for Apache Spark ML for Feature clean-up, feature Importance calculation suite, Information Gain selection, Distributed SMOTE, Model selection and training, Hyper parameter optimization and selection, Model interprability.

https://github.com/databrickslabs/automl-toolkit

1) github stars: 161

2) last release: 0.8.1 (may 26, 2021) first release: 0.7.0 (Feb 23, 2020)

3) Language: Scala 2.12, python (pyspark)

Currently supported models:

"XGBoost" - XGBoost Classifier or XGBoost Regressor

"RandomForest" - Random Forest Classifier or Random Forest Regressor

"GBT" - Gradient Boosted Trees Classifier or Gradient Boosted Trees Regressor

"Trees" - Decision Tree Classifier or Decision Tree Regressor

"LinearRegression" - Linear Regressor

"LogisticRegression" - Logistic Regressor (supports both Binomial and Multinomial)

"MLPC" - Multi-Layer Perceptron Classifier

"SVM" - Linear Support Vector Machines

"LightGBM" (currently suspended, pending library improvements to LightGBM) LightGBM

https://github.com/databrickslabs/automl-toolkit/tree/master/src/main/scala/com/databricks/labs/automl/model

se-teryoshkin commented 2 years ago
  1. GitHub: Stars - 161, Forks - 35, Contributors - 7

  2. Releases: Last Release - 0.8.1 (28th May 2021), First Release - (7th March 2020)

  3. Language: Scala, Python (pyspark)

  4. Data Types: Structured

  5. Execution backends: Spark

  6. Distributed framework: Spark

  7. Distributed algorithms libraries: SparkML

  8. AutoML pipeline elements:

  1. Elements of pipeline are covered in distributed mode: all

  2. Is it possible to mix distributed and non-distributed backends?: No

  3. How is the pipeline converted to computing applications?: Distributed application

  4. Approach to ML models selection computing: Multiple distributed models ?

  5. Data processing steps pruning: Yes

  6. Resource managers: Spark's resource managers

  7. Hyperparameter tuner: Spark ML

  8. Task types: Classification, Regression

  9. Tests: local mode, distributed mode.

  10. Comments: --