Auto Sklearn never stop training model

whoisltd commented 9 months ago

Describe the bug

I have a pod in k8s with 56 cpu. When i run fit() model with classification or regression it will never done task even though time trainng set time_left_for_this_task=60. But when run it in local machine with 8cpu everything work fine. But if i increase time on local machine to time_left_for_this_task=1500. Local machine will not stop training after 1500 seconds like model on k8s. I dont know what leading this error maybe about computer configuration or something else In case have an error i hope have any message return

Expected behavior

Model stop training after end time_left_for_this_task

Actual behavior, stacktrace or logfile

in AutoML(...).log two end lines shows:

[DEBUG] [2023-09-18 17:15:44,242:Client-pynisher] Redirecting output of the function to files. Access them via the stdout and stderr attributes of the wrapped function.
[DEBUG] [2023-09-18 17:15:44,243:Client-pynisher] call function

Environment and installation:

Please give details about your installation:

OS Ubuntu 20.04
virtual environment
Python 3.8
Auto-sklearn 0.15.0

whoisltd commented 9 months ago

Have any update in this problem? And what is minimum configuration for run autosklearn ?

00sapo commented 1 month ago

Hello, I used auto-sklearn in several projects now, but never faced this issue... until today. I think the problem is that autosklearn doesn't really stops ongoing training for certain algorithms but just don't start a newer one if beyond the time limit. I guess that the reason is that certain algorithms ignore some kill signals. I'm also on Linux.

00sapo commented 1 month ago

I used this function as a work-around. Instead of using SIGSTOP, it uses SIGKILL, so any running process is killed and the fit errors, but continues. It needs psutil, though.


def _monitor_children_processes(min_time_limit, max_time_limit):
    """
    Monitor the children processes of this process and kill them if they take
    too long. This spawns a new process which does nothing until `min_time_limit`
    is reached, then it starts waiting for the children processes of this process
    (the parent, not the monitor). If the children processes are still running
    after `max_time_limit`, it kills them with -9.
    """
    import psutil
    from multiprocessing import Process

    def monitor_children_processes(parent):
        pid = psutil.Process().pid
        start_time = time.time()
        while True:
            if time.time() - start_time < min_time_limit:
                time.sleep(60)
                continue
            children = parent.children()
            if len(children) > 1:
                for child in children:
                    # avoid killing this same process
                    if child.pid != pid:
                        try:
                            remaining_time = max_time_limit - (time.time() - start_time)
                            if remaining_time < 0:
                                # kill with -9
                                child.kill()
                            else:
                                child.wait(timeout=remaining_time)
                        except psutil.TimeoutExpired:
                            # kill with -9
                            child.kill()
                        except psutil.NoSuchProcess:
                            pass
            else:
                break

    # run the monitor in a new process
    monitor = Process(target=monitor_children_processes, args=(psutil.Process(),))
    return monitor

monitor = _monitor_children_processes(3500, 3600)
monitor.start() # starts the monitor process
model.fit(X, y) # starts the fit
monitor.wait(3600) # waits for the monitor to finish, but it should end even without this command ```

automl / auto-sklearn