fonhorst / LightAutoML_Spark

Apache License 2.0
7 stars 1 forks source link

Auto-Sklearn #13

Closed se-teryoshkin closed 2 years ago

se-teryoshkin commented 3 years ago

https://github.com/automl/auto-sklearn

  1. GitHub: Stars - 5.8k, Forks - 1.1k, Contributors - 68

  2. Releases: Last Release - 0.14 (19th September 2021), First Release - (17th October 2017)

  3. Language: Python 3.7 - 3.9

  4. Data Types: Structured data. We should pass to the object something like X_train, y_train. But in other side, we can preprocess unstructured data (such as images) and convert it into some kind of structured data/vectors.

  5. Execution backends: joblib, Dask

  6. Distributed framework: Dask

  7. Distributed algorithms libraries: None. It is using sklearn under the hood.

  8. AutoML pipeline elements:

  1. Elements of pipeline are covered in distributed mode: N/A

  2. Is it possible to mix distributed and non-distributed backends?: Within one pipeline - no.

  3. How is the pipeline converted to computing applications?: The pipeline may be distributed or local. I'm not sure if the entire pipeline is distributed. Probably some steps are completed only by client in the local mode.

  4. Approach to ML models selection computing: Multiple distributed models (https://ml.dask.org/hyper-parameter-search.html)

  5. Data processing steps pruning: There is an ability to avoid repeating work using Dask features (https://ml.dask.org/hyper-parameter-search.html#avoid-repeated-work)

  6. Resource managers: TPOT is able to use resource managers from Dask - YARN, Kubernetes, Slurm, LSF, SGE and other (https://docs.dask.org/en/latest/how-to/deploy-dask/hpc.html). There is also an ability to deploy cluster in cloud solutions, such as AWS, GCP, Azure.

  7. Hyperparameter tuner: Dask hyperparameter search (https://ml.dask.org/hyper-parameter-search.html), sklearn

  8. Task types: Classification, Regression

  9. Tests: local mode, distributed mode.

  10. Comments: https://www.automl.org/