Open whoisltd opened 9 months ago
Have any update in this problem? And what is minimum configuration for run autosklearn ?
Hello, I used auto-sklearn in several projects now, but never faced this issue... until today. I think the problem is that autosklearn doesn't really stops ongoing training for certain algorithms but just don't start a newer one if beyond the time limit. I guess that the reason is that certain algorithms ignore some kill signals. I'm also on Linux.
I used this function as a work-around. Instead of using SIGSTOP, it uses SIGKILL, so any running process is killed and the fit errors, but continues. It needs psutil
, though.
def _monitor_children_processes(min_time_limit, max_time_limit):
"""
Monitor the children processes of this process and kill them if they take
too long. This spawns a new process which does nothing until `min_time_limit`
is reached, then it starts waiting for the children processes of this process
(the parent, not the monitor). If the children processes are still running
after `max_time_limit`, it kills them with -9.
"""
import psutil
from multiprocessing import Process
def monitor_children_processes(parent):
pid = psutil.Process().pid
start_time = time.time()
while True:
if time.time() - start_time < min_time_limit:
time.sleep(60)
continue
children = parent.children()
if len(children) > 1:
for child in children:
# avoid killing this same process
if child.pid != pid:
try:
remaining_time = max_time_limit - (time.time() - start_time)
if remaining_time < 0:
# kill with -9
child.kill()
else:
child.wait(timeout=remaining_time)
except psutil.TimeoutExpired:
# kill with -9
child.kill()
except psutil.NoSuchProcess:
pass
else:
break
# run the monitor in a new process
monitor = Process(target=monitor_children_processes, args=(psutil.Process(),))
return monitor
monitor = _monitor_children_processes(3500, 3600)
monitor.start() # starts the monitor process
model.fit(X, y) # starts the fit
monitor.wait(3600) # waits for the monitor to finish, but it should end even without this command ```
Describe the bug
I have a pod in k8s with 56 cpu. When i run fit() model with classification or regression it will never done task even though time trainng set
time_left_for_this_task=60
. But when run it in local machine with 8cpu everything work fine. But if i increase time on local machine totime_left_for_this_task=1500
. Local machine will not stop training after 1500 seconds like model on k8s. I dont know what leading this error maybe about computer configuration or something else In case have an error i hope have any message returnExpected behavior
Model stop training after end time_left_for_this_task
Actual behavior, stacktrace or logfile
in AutoML(...).log two end lines shows:
Environment and installation:
Please give details about your installation: