TPOTRegressor reverts back to single threads after running for some time with n_jobs=-1 #1273

Closed edubu2 closed 1 year ago

edubu2 commented 1 year ago

I started with TPOTRegressor on a large dataset of 8M Rows x 40 features yesterday on a large ML server (Linux RHE) with 16 CPU (2 threads per core) and 256GiB memory (no GPU, no Pytorch NNs). Last night, when I started it, it was running consistently at 3200% CPU (one per thread, as intended). However, when I returned to check on it this morning, total CPU utilization has been reduced back to 100%, sometimes jumping to 200% but not more. This has been happening for at least 4 hours. There is nothing else running on the machine. Perhaps it's trying a model in which multiprocessing isn't possible, but I feel TPOT should be using the available resources for the next pipeline.

Context of the issue

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 17946 ec2-user 20 0 42.3g 14.0g 111504 S 100.3 5.6 25959:06 python

As it's only 2% complete after 12 hours, it's not a viable option for my pipeline tuning and model selection. Downsampling is not ideal for my use case, but I am still using it to reduce the size by 40% to increase speed. For comparison, I'm able to preprocess and run LightGBM model on my local (8 cores/16GB RAM, OSX) using the same data (but no downsampling), in about 5 minutes.

I've one-hot encoded my categorical features and imputed values for all NaN records.

Another thing to point out (likely not useful, but maybe) is that progress bar was displaying 0% for at least 6 hours after starting, while CPU was at 3200%. When I checked this morning, 2% complete with 100-200% CPU utilization.

Process to reproduce the issue

Below code is part of my main() function being called at the command line with nohup run.py &.

tscv = TimeSeriesSplit(n_splits=3)
print("Created CV (tscv).")
tpot = TPOTRegressor(
    log_file = 'tpot_log.log'
print("Created tpot object.")

tpot.fit(X_train, y_train)
print("FINAL SCORE:", tpot.score(X_test, y_test))


After 3-4 hours, it's now back up to 3200%. Likely not an issue with multiprocessing, but I'm really curious about what processing could take so long.

edubu2 commented 1 year ago

edubu2 commented 1 year ago

Log output thus far (has been stuck on pipeline 55 since I checked this morning).

$ cat tpot_log.log
_pre_test decorator: _random_mutation_operator: num_test=0 Expected n_neighbors <= n_samples,  but n_samples = 50, n_neighbors = 85.
_pre_test decorator: _random_mutation_operator: num_test=0 Expected n_neighbors <= n_samples,  but n_samples = 50, n_neighbors = 80.
_pre_test decorator: _random_mutation_operator: num_test=0 manhattan was provided as affinity. Ward can only work with euclidean distances..
_pre_test decorator: _random_mutation_operator: num_test=0 Unsupported set of arguments: The combination of penalty='l2' and loss='epsilon_insensitive' are not supported when dual=False, Parameters: penalty='l2', loss='epsilon_insensitive', dual=False.
_pre_test decorator: _random_mutation_operator: num_test=0 Expected n_neighbors <= n_samples,  but n_samples = 50, n_neighbors = 54.
Optimization Progress:   2%|▏         | 54/2550 [8:57:48<314:58:55, 454.30s/pipeline]
edubu2 commented 1 year ago

Seems to be working. Disregard