automl / auto-sklearn

Automated Machine Learning with scikit-learn
https://automl.github.io/auto-sklearn
BSD 3-Clause "New" or "Revised" License
7.54k stars 1.28k forks source link

[Bug] Under utilizing CPU usage #1363

Open JustinDoIt opened 2 years ago

JustinDoIt commented 2 years ago

I trained my model on a 36 core CPU and set n_jobs=-1 and it worked.

 automl = autosklearn.classification.AutoSklearnClassifier(
n_jobs=-1
)

However, from the perspective of htop, auto-sklearn only occupies one or two cores most of the time. Is there any way to improve CPU utilization?

CPU_info

eddiebergman commented 2 years ago

Hi @JustinDoIt,

We seem to generally have issues with large amount of available CPU's. We are unable to test these problems very well due to our infrastructure.

A common issue is documented in #1236, which is odd that this bug it does not happen in your case, thank you for reporting it.

We use SMAC as our optimizer and ConfigSpace as our search space builder. We believe the bottleneck is in one of these two places but we need to spend some dedicated time to figure out why new processes are not started.

In the meantime, it would help to get more of an idea on when the bottleneck is reached. For example, autosklearn can effectively use around 6 cores on my machine no problem.

Best, Eddie

JustinDoIt commented 2 years ago

Hi @eddiebergman

Actually I also encounter the error mentioned in #1236 (details follow)

  1. n_jobs=8 the issue still remains.
  2. size of dataset is 1076
  3. I test it for quiet enough times and sometimes it works normally, and sometimes it encounters memory utilization problems (same code, a little bit different params)

In a specific case, all cores work normally in the first 30 iterations. But after about 30 iterations, only one core works. At 263 iterations, the memory consumption of two processes was suddenly very large (I didn't notice whether it was also large before, I missed it). Subsequently, the error mentioned in #1236 was encountered and the program crashed

image

e3vela commented 2 years ago

@JustinDoIt just out of curiosity, have you also used additional arguments when instantiating the class? I used to have the same issue but got it solved when I realized that I was messing around with the metric and scoring_functions argument. I can be more detailed if you want.

JustinDoIt commented 2 years ago

@e3vela Here are my code and thanks for comments

# -*- coding: utf-8 -*

import autosklearn.classification
import autosklearn

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, accuracy_score, recall_score, precision_score, confusion_matrix, matthews_corrcoef,cohen_kappa_score,mean_absolute_error,mean_squared_error,r2_score
from sklearn.inspection import plot_partial_dependence, permutation_importance
import matplotlib.pyplot as plt

from autosklearn.metrics import balanced_accuracy, precision, recall, f1

import os
import ast

import logging
import time
from time import strftime, gmtime
import random
import sys
from datetime import datetime

from joblib import dump, load

def automl_feat_comb(feat_comb, exp_name, runtime):
    now_time = str(datetime.now()).replace(' ', '-').replace('.', '-')
    log = create_logger(
        name=exp_name,
        silent=False,
        to_disk=True,
        log_file=f'{exp_name}_{now_time}.txt',
    )
    logging_config = {
        'version': 1,
        'disable_existing_loggers': False,
        'formatters': {
            'custom': {
                # More format options are available in the official
                # `documentation <https://docs.python.org/3/howto/logging-cookbook.html>`_
                'format': '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
            }
        },

        # Any INFO level msg will be printed to the console
        'handlers': {
            'console': {
                'level': 'INFO',
                'formatter': 'custom',
                'class': 'logging.StreamHandler',
                'stream': 'ext://sys.stdout',
            },
        },

        'loggers': {
            '': {  # root logger
                'level': 'DEBUG',
            },
            'Client-EnsembleBuilder': {
                'level': 'DEBUG',
                'handlers': ['console'],
            },
        },
    }

    # Data
    df = pd.read_csv('xxxxx.csv') # sorry 
    X = df.loc[:, feat_comb]
    y = df.loc[:, 'Status']
    x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

    # Model
    per_runtime = min(runtime // 10, 1800)
    automl = autosklearn.classification.AutoSklearnClassifier(
        time_left_for_this_task=runtime,
        per_run_time_limit=per_runtime,
        initial_configurations_via_metalearning=25,
        ensemble_size=50,
        ensemble_nbest=50,
        max_models_on_disc=None,
        memory_limit=None,
        tmp_folder='./tmp_folder' + str(random.randrange(1,1000)),
        delete_tmp_folder_after_terminate=True,
        n_jobs=-1,
        seed=42,
        logging_config=logging_config,
    )
    # Train
    automl.fit(x_train, y_train)

    # save 
    dump(automl, 'september.joblib')

    # Evaluation
    predictions_x_train = automl.predict(x_train)
    predictions = automl.predict(x_test)

    # Logging
    log.info("[INFO] start training....")

def main():
    feat_combs = {
        'ga10000': ['XXX', ..., 'XXXX'],
    }
    for exp_name, feat_comb in feat_combs.items():
        runtime = 600
        automl_feat_comb(feat_comb, exp_name=f"{exp_name}_feat_{len(feat_comb)}", runtime=runtime)
e3vela commented 2 years ago

I haven't tried the code myself but I don't see anything wrong in your implementation. I might run it later when I have the time.

mfeurer commented 2 years ago

This is most likely due to https://github.com/automl/SMAC3/issues/774, which basically says that getting new configurations (i.e. which model with which hyperparameters to try next) is not executed in parallel. When running in parallel, and evaluating configurations is faster than the suggestion mechanism, you'll observe the pattern reported here, namely that auto-sklearn uses only a single core. Up to iteration 30, auto-sklearn suggests configurations via meta-learning (in a single batch), which explains why parallelism works in the beginning. Unfortunately, there is not really anything that can be done about this. In such cases you might be better of using random search as it can make full use of the parallel setting.

JustinDoIt commented 2 years ago

OK, I just know a little about meta-learning, but it sounds like: after meta-learning (found nice configuration), it doesn't need to continue parallel, does it?

In my tests and in my case, parallelism will not affect the accuracy, (and combined with my understanding above) so I think this is a not-real bug that doesn't need to be paid attention to. Therefore, I will close this issue in 3 days if there is no objection.

By the way, I actually encounter the bug mentioned #1236 (but not every time). I think there may be some relationship between the two issues (meta-learning?)

Finally, thank the auto-sklearn team for your contribution. Auto-sklearn is really an awesome and great nuclear weapon. :D

mfeurer commented 1 year ago

Re-opening this to track that we still have an issue with parallelism when the dataset is small.