EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.58k stars 1.55k forks source link

TPOTRegressor reverts back to single threads after running for some time with n_jobs=-1 #1273

Closed edubu2 closed 1 year ago

edubu2 commented 1 year ago

I started with TPOTRegressor on a large dataset of 8M Rows x 40 features yesterday on a large ML server (Linux RHE) with 16 CPU (2 threads per core) and 256GiB memory (no GPU, no Pytorch NNs). Last night, when I started it, it was running consistently at 3200% CPU (one per thread, as intended). However, when I returned to check on it this morning, total CPU utilization has been reduced back to 100%, sometimes jumping to 200% but not more. This has been happening for at least 4 hours. There is nothing else running on the machine. Perhaps it's trying a model in which multiprocessing isn't possible, but I feel TPOT should be using the available resources for the next pipeline.

Context of the issue

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 17946 ec2-user 20 0 42.3g 14.0g 111504 S 100.3 5.6 25959:06 python

As it's only 2% complete after 12 hours, it's not a viable option for my pipeline tuning and model selection. Downsampling is not ideal for my use case, but I am still using it to reduce the size by 40% to increase speed. For comparison, I'm able to preprocess and run LightGBM model on my local (8 cores/16GB RAM, OSX) using the same data (but no downsampling), in about 5 minutes.

I've one-hot encoded my categorical features and imputed values for all NaN records.

Another thing to point out (likely not useful, but maybe) is that progress bar was displaying 0% for at least 6 hours after starting, while CPU was at 3200%. When I checked this morning, 2% complete with 100-200% CPU utilization.

Process to reproduce the issue

Below code is part of my main() function being called at the command line with nohup run.py &.

tscv = TimeSeriesSplit(n_splits=3)
print("Created CV (tscv).")
tpot = TPOTRegressor(
    use_dask=True,
    subsample=0.6,
    generations=50,
    population_size=50,
    scoring='neg_mean_absolute_error',
    cv=tscv,                    
    random_state=12,
    n_jobs=-1,
    verbosity=3,
    log_file = 'tpot_log.log'
)
print("Created tpot object.")

tpot.fit(X_train, y_train)
print("FINAL SCORE:", tpot.score(X_test, y_test))

Update

After 3-4 hours, it's now back up to 3200%. Likely not an issue with multiprocessing, but I'm really curious about what processing could take so long.

edubu2 commented 1 year ago

Conda list output:

Name Version Build Channel

_libgcc_mutex 0.1 conda_forge conda-forge _openmp_mutex 4.5 2_kmp_llvm conda-forge _py-xgboost-mutex 2.0 cpu_0 conda-forge abseil-cpp 20211102.0 h93e1e8c_2 conda-forge arrow-cpp 8.0.0 py310h3098874_0
aws-c-common 0.4.57 he6710b0_1
aws-c-event-stream 0.1.6 h2531618_5
aws-checksums 0.1.9 he6710b0_0
aws-sdk-cpp 1.8.185 hce553d0_0
blas 1.0 mkl
bokeh 2.4.3 pyhd8ed1ab_3 conda-forge boost-cpp 1.78.0 he72f1d9_0 conda-forge boto3 1.24.28 py310h06a4308_0 anaconda botocore 1.27.28 py310h06a4308_0 anaconda brotli 1.0.9 h166bdaf_7 conda-forge brotli-bin 1.0.9 h166bdaf_7 conda-forge brotlipy 0.7.0 py310h5764c6d_1004 conda-forge bzip2 1.0.8 h7b6447c_0
c-ares 1.18.1 h7f98852_0 conda-forge ca-certificates 2022.9.24 ha878542_0 conda-forge certifi 2022.9.24 pyhd8ed1ab_0 conda-forge cffi 1.15.1 py310h74dc2b5_0
charset-normalizer 2.1.1 pyhd8ed1ab_0 conda-forge click 8.1.3 py310hff52083_0 conda-forge cloudpickle 2.2.0 pyhd8ed1ab_0 conda-forge colorama 0.4.5 pyhd8ed1ab_0 conda-forge cryptography 38.0.2 py310h597c629_1 conda-forge cudatoolkit 11.3.1 h2bc3f7f_2
cytoolz 0.12.0 py310h5764c6d_0 conda-forge dask 2022.10.0 pyhd8ed1ab_2 conda-forge dask-core 2022.10.0 pyhd8ed1ab_1 conda-forge dask-glm 0.2.0 py_1 conda-forge dask-ml 2022.5.27 pyhd8ed1ab_0 conda-forge deap 1.3.3 py310h769672d_0 conda-forge distributed 2022.10.0 pyhd8ed1ab_2 conda-forge fftw 3.3.9 h27cfd23_1
freetype 2.10.4 h0708190_1 conda-forge fsspec 2022.10.0 pyhd8ed1ab_0 conda-forge gflags 2.2.2 he1b5a44_1004 conda-forge giflib 5.2.1 h36c2ea0_2 conda-forge glog 0.6.0 h6f12383_0 conda-forge greenlet 1.1.1 py310h295c915_0 anaconda grpc-cpp 1.46.1 h33aed49_0
heapdict 1.0.1 py_0 conda-forge icu 70.1 h27087fc_0 conda-forge idna 3.4 pyhd8ed1ab_0 conda-forge intel-openmp 2021.4.0 h06a4308_3561
jinja2 3.1.2 pyhd8ed1ab_1 conda-forge jmespath 0.10.0 pyhd3eb1b0_0 anaconda joblib 1.2.0 pyhd8ed1ab_0 conda-forge jpeg 9e h166bdaf_2 conda-forge krb5 1.19.2 hac12032_0 anaconda lcms2 2.12 hddcbb42_0 conda-forge ld_impl_linux-64 2.38 h1181459_1
lerc 3.0 h9c3ff4c_0 conda-forge libabseil 20211102.0 cxx17_h48a1fff_2 conda-forge libbrotlicommon 1.0.9 h166bdaf_7 conda-forge libbrotlidec 1.0.9 h166bdaf_7 conda-forge libbrotlienc 1.0.9 h166bdaf_7 conda-forge libcurl 7.84.0 h91b91d3_0
libdeflate 1.8 h7f8727e_5
libedit 3.1.20210910 h7f8727e_0 anaconda libev 4.33 h516909a_1 conda-forge libevent 2.1.10 h9b69904_4 conda-forge libffi 3.3 he6710b0_2
libgcc-ng 12.2.0 h65d4601_19 conda-forge libgfortran-ng 11.2.0 h00389a5_1
libgfortran5 11.2.0 h1234567_1
libllvm11 11.1.0 hf817b99_2 conda-forge libnghttp2 1.46.0 hce63b2e_0
libpng 1.6.37 hbc83047_0
libpq 12.9 h16c4e8d_3 anaconda libprotobuf 3.20.1 h4ff587b_0
libssh2 1.10.0 ha56f1ee_2 conda-forge libstdcxx-ng 12.2.0 h46fd767_19 conda-forge libthrift 0.15.0 he6d91bd_0 conda-forge libtiff 4.4.0 hecacb30_0
libuuid 1.0.3 h7f8727e_2
libwebp 1.2.4 h522a892_0 conda-forge libwebp-base 1.2.4 h166bdaf_0 conda-forge libxgboost 1.6.2 cpu_ha3b9936_1 conda-forge llvm-openmp 14.0.6 h9e868ea_0
llvmlite 0.39.1 py310he621ea3_0
locket 1.0.0 pyhd8ed1ab_0 conda-forge lz4 4.0.0 py310h5d5e884_2 conda-forge lz4-c 1.9.3 h9c3ff4c_1 conda-forge markupsafe 2.1.1 py310h5764c6d_1 conda-forge mkl 2021.4.0 h06a4308_640
mkl-service 2.4.0 py310h7f8727e_0
mkl_fft 1.3.1 py310hd6ae3a3_0
mkl_random 1.2.2 py310h00e6091_0
msgpack-python 1.0.4 py310hbf28c38_0 conda-forge multipledispatch 0.6.0 py_0 conda-forge ncurses 6.3 h5eee18b_3
numba 0.56.3 py310ha5257ce_0 conda-forge numpy 1.22.3 py310hfa59a62_0
numpy-base 1.22.3 py310h9585f30_0
openssl 1.1.1q h166bdaf_1 conda-forge orc 1.7.4 h07ed6aa_0
packaging 21.3 pyhd8ed1ab_0 conda-forge pandas 1.5.1 py310h769672d_0 conda-forge partd 1.3.0 pyhd8ed1ab_0 conda-forge pillow 9.2.0 py310hace64e9_1
pip 22.2.2 py310h06a4308_0
psutil 5.9.3 py310h5764c6d_0 conda-forge psycopg2 2.8.6 py310h8f2d780_1 anaconda py-xgboost 1.6.2 cpu_py310hd1aba9c_1 conda-forge pyarrow 8.0.0 py310h468efa6_0
pycparser 2.21 pyhd8ed1ab_0 conda-forge pyopenssl 22.1.0 pyhd8ed1ab_0 conda-forge pyparsing 3.0.9 pyhd8ed1ab_0 conda-forge pysocks 1.7.1 pyha2e5f31_6 conda-forge python 3.10.6 haa1d7c7_1
python-dateutil 2.8.2 pyhd8ed1ab_0 conda-forge python_abi 3.10 2_cp310 conda-forge pytorch 1.12.1 py3.10_cuda11.3_cudnn8.3.2_0 pytorch pytorch-mutex 1.0 cuda pytorch pytz 2022.5 pyhd8ed1ab_0 conda-forge pyyaml 6.0 py310h5764c6d_4 conda-forge re2 2022.04.01 h27087fc_0 conda-forge readline 8.1.2 h7f8727e_1
requests 2.28.1 pyhd8ed1ab_1 conda-forge s3transfer 0.6.0 py310h06a4308_0 anaconda scikit-learn 1.1.2 py310h6a678d5_0
scipy 1.9.1 py310hd5efca6_0
setuptools 63.4.1 py310h06a4308_0
six 1.16.0 pyhd3eb1b0_1
snappy 1.1.9 hbd366e4_1 conda-forge sortedcontainers 2.4.0 pyhd8ed1ab_0 conda-forge sqlalchemy 1.4.39 py310h5eee18b_0 anaconda sqlite 3.39.3 h5082296_0
stopit 1.1.2 py_0 conda-forge tblib 1.7.0 pyhd8ed1ab_0 conda-forge threadpoolctl 3.1.0 pyh8a188c0_0 conda-forge tk 8.6.12 h1ccaba5_0
toolz 0.12.0 pyhd8ed1ab_0 conda-forge tornado 6.1 py310h5764c6d_3 conda-forge tpot 0.11.7 pyhd8ed1ab_1 conda-forge tqdm 4.64.1 pyhd8ed1ab_0 conda-forge typing_extensions 4.4.0 pyha770c72_0 conda-forge tzdata 2022e h04d1e81_0
update_checker 0.18.0 pyh9f0ad1d_0 conda-forge urllib3 1.26.11 pyhd8ed1ab_0 conda-forge utf8proc 2.6.1 h27cfd23_0
wheel 0.37.1 pyhd3eb1b0_0
xz 5.2.6 h5eee18b_0
yaml 0.2.5 h7f98852_2 conda-forge zict 2.2.0 pyhd8ed1ab_0 conda-forge zlib 1.2.13 h5eee18b_0
zstd 1.5.2 ha4553b6_0

edubu2 commented 1 year ago

Log output thus far (has been stuck on pipeline 55 since I checked this morning).

$ cat tpot_log.log
_pre_test decorator: _random_mutation_operator: num_test=0 Expected n_neighbors <= n_samples,  but n_samples = 50, n_neighbors = 85.
_pre_test decorator: _random_mutation_operator: num_test=0 Expected n_neighbors <= n_samples,  but n_samples = 50, n_neighbors = 80.
_pre_test decorator: _random_mutation_operator: num_test=0 manhattan was provided as affinity. Ward can only work with euclidean distances..
_pre_test decorator: _random_mutation_operator: num_test=0 Unsupported set of arguments: The combination of penalty='l2' and loss='epsilon_insensitive' are not supported when dual=False, Parameters: penalty='l2', loss='epsilon_insensitive', dual=False.
_pre_test decorator: _random_mutation_operator: num_test=0 Expected n_neighbors <= n_samples,  but n_samples = 50, n_neighbors = 54.
Optimization Progress:   2%|▏         | 54/2550 [8:57:48<314:58:55, 454.30s/pipeline]
edubu2 commented 1 year ago

Seems to be working. Disregard