AutoMLs Benchmark. Why TPOT so bad?

I made Benchmark AutoML libs, and TPOT showed very poor results, even worse than the usual CatBoost with standard parameters! https://github.com/Alex-Lekov/AutoML-Benchmark/ I run the benchmark in docker - so you can easily reproduce it

here is the code from the benchmark:

automl = TPOTClassifier(max_time_mins=(TIME_LIMIT//60),
                        scoring='roc_auc', 
                        verbosity=1,
                        random_state=RANDOM_SEED, )

automl.fit(X_train, y_train,)                                             

try:
    predictions = automl.predict_proba(X_test)
except RuntimeError:
    predictions = automl.predict(X_test)

# TPOT make a different predictions format, depending on the algorithm :(

try:
     y_test_predict_proba = predictions[:,1]
except IndexError:
     y_test_predict_proba = predictions

y_test_predict = automl.predict(X_test)

print('AUC: ', roc_auc_score(y_test, y_test_predict_proba))

Is the code correct? (I do not adjust the advanced parameters, since AutoML, in theory, should pick everything up by itself, that's why it is AutoML)

if we specify scoring = 'roc_auc', does it start optimizing for AUC for sure?

Please, tell me what am I doing wrong? Maybe I am using the library "incorrectly"? or is this a real result and the library is really that bad?

EpistasisLab / tpot

AutoMLs Benchmark. Why TPOT so bad? #1100