autonomio / talos

Hyperparameter Experiments with TensorFlow and Keras
https://autonom.io
MIT License
1.62k stars 268 forks source link

Talos fails when 'early_stopper' or 'EarlyStopping' is added #313

Closed bjtho08 closed 5 years ago

bjtho08 commented 5 years ago

I've attempted to run a talos-wired keras model where I test for different dropout rates and decay rates. When adding an early stopping callback, the talos Scan fails after the entire scan is complete. I'm running talos 0.5.0, tensorflow-gpu 1.10.0 using python 3.6. Below is the relevant code and error traceback

from talos.model.early_stopper import early_stopper
img_rows, img_cols, img_channels = (None, None, 3)

architecture params

nb_filters_0 = 64
sigma_noise = 0.01
drop = 0
batchnorm = True
nb_filters_0 = 64
sigma_noise = 0.01
pretrain=2
shape = (img_rows, img_cols, img_channels)
batch_size = 16
nb_epoch = 200
verbose = 0

Loss function

opt_name = "adam"  # choices:adadelta; sgd, rmsprop, adagrad, adam
loss_func = categorical_crossentropy  # mse, mae, binary_crossentropy, jaccard2_loss, categorical_crossentropy, tversky_loss
if opt_name == "sgd":
    opt = SGD(lr=0.1)
elif opt_name == "rmsprop":
    opt = RMSprop()
elif opt_name == "adagrad":
    opt = Adagrad()
elif opt_name == "adadelta":
    opt = Adadelta()
elif opt_name == "adam":
    opt = Adam(lr=1e-4, decay=0.1)
elif opt_name == "amsgrad":
    opt = Adam(lr=1e-4, amsgrad=True)
elif opt_name == "adamax":
    opt = Adamax()
elif opt_name == "nadam":
    opt = Nadam()
else:
    raise NameError("Wrong optimizer name")

deep learning model

def talos_model(x, y, val_x, val_y, params):
    model = u_net((None, None, 3),
                  params['nb_filters_0'],
                  sigma_noise=params['sigma_noise'],
                  depth=4,
                  dropout=params['dropout'],
                  output_channels=params['num_cls'],
                  batchnorm=params['batchnorm'],
                  pretrain=params['pretrain'],
                 )
    model.compile(loss=params['loss_func'], optimizer=Adam(lr=1e-4, decay=params['decay']), metrics=["categorical_accuracy"])

    train_tiles = [
        osp.splitext(osp.basename(i))[0]
        for i in glob(osp.join(params['data_path'], params['train_path'], "*.tif"))
    ]
    val_tiles = [
        osp.splitext(osp.basename(i))[0]
        for i in glob(osp.join(params['data_path'], params['val_path'], "*.tif"))
    ]
    train_generator = DataGenerator(
        osp.join(params['data_path'], params['train_path']),
        params['colorvec'],
        params['train_m'],
        params['train_s'],
        train_tiles,
        batch_size=params['batch_size'],
        dim=(208, 208),
        n_channels=3,
        n_classes=params['num_cls'],
        shuffle=True,
        augmenter=True,
    )
    val_generator = DataGenerator(
        osp.join(params['data_path'], params['val_path']),
        params['colorvec'],
        params['train_m'],
        params['train_s'],
        val_tiles,
        batch_size=params['batch_size'],
        dim=(208, 208),
        n_channels=3,
        n_classes=params['num_cls'],
        shuffle=True,
        augmenter=True,
    )

    if not os.path.exists(osp.join(params['weight_path'], params['today_str'])):
        os.mkdir(osp.join(params['weight_path'], params['today_str']))

    modelpath = osp.join(
            weight_path, today_str, "talos_bn_U-net_model-{}epochs_batchsize_{}.loss_func_{}-weights.pickle".format(
                str(params['nb_epoch']),
                str(params['batch_size']),
                "categorical_crossentropy"
            )
        )
    history = model.fit_generator(generator=train_generator,
                                  epochs=params['nb_epoch'],
                                  validation_data=val_generator,
                                  use_multiprocessing=True,
                                  workers=30,
                                  verbose=params['verbose'],
                                  callbacks=[
                                      TensorBoard(
                                          log_dir='./logs/decay_{}-drop_{}/'.format(
                                              params['decay'], params['dropout']
                                          ),
                                          histogram_freq=0,
                                          batch_size=params['batch_size'],
                                          write_graph=True,
                                          write_grads=False,
                                          write_images=True,
                                          embeddings_freq=0,
                                          update_freq=160),
                                      #EarlyStopping(monitor="loss", patience=10, verbose=1),
                                      early_stopper(params['nb_epoch'], mode=[0.0001, 5]),
                                      ReduceLROnPlateau(
                                          monitor="loss",
                                          factor=0.1,
                                          patience=3,
                                          min_lr=5e-7,
                                          verbose=1,
                                      ),
                                      ModelCheckpointLight(
                                          modelpath,
                                          verbose=1,
                                          monitor="loss",
                                          save_best_only=True,
                                      ),
                                  ]
                                 )
    return history, model

talos fit params

today_str = str(datetime.date.today())
p = {'dropout': [0.0, 0.05, 0.1, 0.2],
     'decay': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5],
     'loss_func' : [categorical_crossentropy],
     'nb_filters_0' : [64],
     'sigma_noise' : [0.01],
     'num_cls' : [num_cls], # num_cls = 7
     'batchnorm' : [True], # use batch normalization
     'pretrain' : [2],
     'nb_epoch' : [200],
     'verbose' : [1],
     'batch_size' : [16],
     'data_path' : [data_path], # path string
     'train_path' : [train_path], # path string
     'val_path' : [val_path], # path string
     'colorvec' : [colorvec], # list of list of ints
     'train_m' : [train_m], # list of floats for normalization
     'train_s' : [train_s], # list of floats for normalization
     'weight_path' : ["./weights/"],
     'today_str' : [today_str],
    }

weight_path = "./weights/"
dummy_x = np.empty((1, batch_size, 208, 208))
dummy_y = np.empty((1, batch_size))

t = ta.Scan(x=dummy_x,
            y=dummy_y,
            model=talos_model,
            #functional_model=True,
            #grid_downsample=0.1, 
            params=p)

Traceback

ValueError                                Traceback (most recent call last)
<ipython-input-7-3b0ec144b1ad> in <module>()
    164             #functional_model=True,
    165             #grid_downsample=0.1,
--> 166             params=p)
    167 
    168 print(model.summary(line_length=124))

~/.pyenv/versions/3.6.0/lib/python3.6/site-packages/talos/scan/Scan.py in __init__(self, x, y, params, model, dataset_name, experiment_no, experiment_name, x_val, y_val, val_split, shuffle, round_limit, time_limit, grid_downsample, random_method, seed, search_method, permutation_filter, reduction_method, reduction_interval, reduction_window, reduction_threshold, reduction_metric, reduce_loss, last_epoch_value, clear_tf_session, disable_progress_bar, print_params, debug)
    183         # input parameters section ends
    184 
--> 185         self._null = self.runtime()
    186 
    187     def runtime(self):

~/.pyenv/versions/3.6.0/lib/python3.6/site-packages/talos/scan/Scan.py in runtime(self)
    188 
    189         self = scan_prepare(self)
--> 190         self = scan_run(self)

~/.pyenv/versions/3.6.0/lib/python3.6/site-packages/talos/scan/scan_run.py in scan_run(self)
     30     self.peak_epochs_df = peak_epochs_todf(self)
     31 
---> 32     self = scan_finish(self)

~/.pyenv/versions/3.6.0/lib/python3.6/site-packages/talos/scan/scan_finish.py in scan_finish(self)
     28     # clean the results into a dataframe
     29     self.data = self.result[self.result.columns[0]].str.split(',', expand=True)
---> 30     self.data.columns = self.result.columns[0].split(',')
     31 
     32     # remove redundant columns

~/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/generic.py in __setattr__(self, name, value)
   4387         try:
   4388             object.__getattribute__(self, name)
-> 4389             return object.__setattr__(self, name, value)
   4390         except AttributeError:
   4391             pass

pandas/_libs/properties.pyx in pandas._libs.properties.AxisProperty.__set__()

~/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/generic.py in _set_axis(self, axis, labels)
    644 
    645     def _set_axis(self, axis, labels):
--> 646         self._data.set_axis(axis, labels)
    647         self._clear_item_cache()
    648 

~/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals.py in set_axis(self, axis, new_labels)
   3321             raise ValueError(
   3322                 'Length mismatch: Expected axis has {old} elements, new '
-> 3323                 'values have {new} elements'.format(old=old_len, new=new_len))
   3324 
   3325         self.axes[axis] = new_labels

ValueError: Length mismatch: Expected axis has 29 elements, new values have 25 elements
mikkokotila commented 5 years ago

Thanks a lot for the nicely laid out report! :)

Have you checked #304? I think the same suggestion applies to your case i.e. try v.0.6. Could you try it and update the ticket accordingly. Thanks and have a nice day ahead as well.

bjtho08 commented 5 years ago

@mikkokotila I am testing it right now with 0.6.0, but there is still about 14 hours left before it is done (in hindsight, I should have reduced the parameter space before starting it again).

mikkokotila commented 5 years ago

Thanks for update. And no worries, I think you should be ok with v.0.6

bjtho08 commented 5 years ago

It worked perfectly! However, I noticed that the live() callback has been removed, but according to autonomio.github.io it should be there? Also, there are a lot of inconsistensies between the parameters listed in the docstrings and the actual accepted parameters across all branches. Maybe a clean-up is warranted?

mikkokotila commented 5 years ago

Thanks a lot for the update. Yes, a cleanup will be done before this makes it to the production cycle (replacing daily-dev branch initially). Thanks for pointing that out.

However, I noticed that the live() callback has been removed, but according to autonomio.github.io it should be there?

Let's see how this will be handled in v.0.6. For now you can just import it from kerasplotlit package and it will work.

Closing here. Feel free to open new issue if anything.