autonomio / talos

Hyperparameter Experiments with TensorFlow and Keras
https://autonom.io
MIT License
1.62k stars 268 forks source link

ValueError: Length mismatch: Expected axis has 21 elements, new values have 11 elements #335

Closed bjtho08 closed 5 years ago

bjtho08 commented 5 years ago

Bug description

Talos fails upon completion of all rounds in parameter search because of a mismatch between axis and new values.

Output of shape for x and y

X.shape = (16, 208, 208, 3) Y.shape = (16, 208, 208, 10)

Talos params dictionary

p = {
    "dropout": [0],
    "decay": [0.0],
    "lr": [1e-4],
    "sigma_noise": [0],
    #"pretrain": [2, 0],
    #"class_weights": [False, True],
    "loss_func": [tversky_loss]
}

The Keras model wired for Talos

def talos_presets(weight_path, cls_wgts, static_params, train_generator, val_generator):
    """Initialize a talos model object for hyper-parameter search

    :param weight_path: Path to the base weight folder
    :type weight_path: str
    :param cls_wgts: A list containing the weights applied to each class,
        or None
    :type cls_wgts: None, or List of floats
    :param params: Dictionary of fixed parameters in the model
    :type params: Dict
    :param train_generator: Generator function for training data
    :type train_generator: Class
    :param train_generator: Generator function for validation data
    :type train_generator: Class
    """
    def talos_model(x, y, val_x, val_y, params):
        """Talos model setup

        :param x: Dummy input needed for talos framework
        :type x: Array-like
        :param y: Dummy input needed for talos framework
        :type y: Array-like
        :param val_x: Dummy input needed for talos framework
        :type val_x: Array-like
        :param val_y: Dummy input needed for talos framework
        :type val_y: Array-like
        :param params: Hyperparameters supplied by talos
        :type params: Dict
        """
        # Dummy inputs
        _ = x, y, val_x, val_y
        params.update(static_params)
        if params["loss_func"] == "cat_CE":
            loss_func = categorical_crossentropy
        elif params["loss_func"] == "cat_FL":
            cat_focal_loss = categorical_focal_loss()
            loss_func = cat_focal_loss
        elif hasattr(params["loss_func"], '__call__'):
            loss_func = params["loss_func"]
        else:
            raise NameError("Wrong loss function name")
        # mse, mae, binary_crossentropy, jaccard2_loss, categorical_crossentropy,
        # tversky_loss, categorical_focal_loss
        if params["class_weights"] is False:
            class_weights = [1 if k != 12 else 0 for k in cls_wgts.keys()]
        else:
            class_weights = ([v for v in cls_wgts.values()],)
        try:
            loss_name = params["loss_func"].__name__
        except AttributeError:
            loss_name = params["loss_func"].__str__()
        model_base_path = osp.join(
            weight_path,
            params["today_str"],
            "{}-{}epochs-bs_{}".format(
                loss_name,
                str(params["nb_epoch"]),
                str(params["batch_size"]),
            ))

        if not os.path.exists(model_base_path):
            os.makedirs(model_base_path, exist_ok=True)

        modelpath = osp.join(
            model_base_path,
            "talos_U-net_model-"
            + "decay_{}-drop_{}-weights_{}-pretrain_{}-sigma_{}.h5".format(
                params["decay"],
                params["dropout"],
                params["class_weights"],
                params["pretrain"],
                params["sigma_noise"],
            ),
        )
        log_path = (
            "./logs/"
            + "{}/lossfunc_{}/decay_{}-drop_{}-weights_{}-pretrain_{}-sigma_{}/".format(
                params["today_str"],
                loss_name,
                params["decay"],
                params["dropout"],
                params["class_weights"],
                params["pretrain"],
                params["sigma_noise"],
            )
        )

        if params["pretrain"] != 0:
            print(
                "starting with frozen layers\nclass weights: {}".format(class_weights)
            )
            model = u_net(
                params["shape"],
                int(params["nb_filters_0"]),
                sigma_noise=params["sigma_noise"],
                depth=4,
                dropout=params["dropout"],
                output_channels=params["num_cls"],
                batchnorm=params["batchnorm"],
                pretrain=params["pretrain"],
            )
            model.compile(
                loss=loss_func,
                optimizer=Adam(lr=params["lr"], decay=params["decay"]),
                metrics=["acc"],
            )

            history = model.fit_generator(
                generator=train_generator,
                epochs=10,
                validation_data=val_generator,
                use_multiprocessing=True,
                workers=30,
                class_weight=class_weights,
                verbose=params["verbose"],
            )

            pretrain_layers = [
                "block{}_conv{}".format(block, layer)
                for block in range(1, params["pretrain"] + 1)
                for layer in range(1, 3)
            ]
            for n in pretrain_layers:
                model.get_layer(name=n).trainable = True
            print("layers unfrozen\n")

            model.compile(
                loss=loss_func,
                optimizer=Adam(lr=params["lr"], decay=params["decay"]),
                metrics=["acc"],
            )

            history = model.fit_generator(
                generator=train_generator,
                epochs=params["nb_epoch"],
                validation_data=val_generator,
                use_multiprocessing=True,
                workers=30,
                class_weight=class_weights,
                verbose=params["verbose"],
                callbacks=[
                    TQDMNotebookCallback(
                        metric_format="{name}: {value:0.4f}",
                        leave_inner=True,
                        leave_outer=True,
                    ),
                    TensorBoard(
                        log_dir=log_path,
                        histogram_freq=0,
                        batch_size=params["batch_size"],
                        write_graph=True,
                        write_grads=False,
                        write_images=True,
                        embeddings_freq=0,
                        update_freq="epoch",
                    ),
                    EarlyStopping(
                        monitor="loss",
                        min_delta=0.0001,
                        patience=10,
                        verbose=0,
                        mode="auto",
                    ),
                    ReduceLROnPlateau(
                        monitor="loss", factor=0.1, patience=3, min_lr=1e-7, verbose=1
                    ),
                    PatchedModelCheckpoint(
                        modelpath, verbose=0, monitor="loss", save_best_only=True
                    ),
                ],
            )
        else:
            print("No layers frozen at start\nclass weights: {}".format(class_weights))
            model = u_net(
                params["shape"],
                int(params["nb_filters_0"]),
                sigma_noise=params["sigma_noise"],
                depth=4,
                dropout=params["dropout"],
                output_channels=params["num_cls"],
                batchnorm=params["batchnorm"],
                pretrain=params["pretrain"],
            )

            model.compile(
                loss=loss_func,
                optimizer=Adam(lr=params["lr"], decay=params["decay"]),
                metrics=["acc"],
            )

            history = model.fit_generator(
                generator=train_generator,
                epochs=params["nb_epoch"],
                validation_data=val_generator,
                use_multiprocessing=True,
                workers=30,
                class_weight=class_weights,
                verbose=params["verbose"],
                callbacks=[
                    TQDMNotebookCallback(
                        metric_format="{name}: {value:0.4f}",
                        leave_inner=True,
                        leave_outer=True,
                    ),
                    TensorBoard(
                        log_dir=log_path,
                        histogram_freq=0,
                        batch_size=params["batch_size"],
                        write_graph=True,
                        write_grads=False,
                        write_images=True,
                        embeddings_freq=0,
                        update_freq="epoch",
                    ),
                    EarlyStopping(
                        monitor="loss",
                        min_delta=0.0001,
                        patience=10,
                        verbose=0,
                        mode="auto",
                    ),
                    ReduceLROnPlateau(
                        monitor="loss", factor=0.1, patience=3, min_lr=1e-7, verbose=1
                    ),
                    PatchedModelCheckpoint(
                        modelpath, verbose=0, monitor="loss", save_best_only=True
                    ),
                ],
            )
        return history, model
    return talos_model

Traceback

-------------------------------------------------------------------------
100%|██████████| 1/1 [1:43:42<00:00, 6222.81s/it]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-70-acd959206821> in <module>()
     83     # functional_model=True,
     84     # grid_downsample=0.1,
---> 85     params=p,
     86 )
     87 

~/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/talos/scan/Scan.py in __init__(self, x, y, params, model, experiment_name, x_val, y_val, val_split, random_method, performance_target, fraction_limit, round_limit, time_limit, boolean_limit, reduction_method, reduction_interval, reduction_window, reduction_threshold, reduction_metric, minimize_loss, seed, clear_session, disable_progress_bar, print_params, debug)
    170         # input parameters section ends
    171 
--> 172         self.runtime()
    173 
    174     def runtime(self):

~/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/talos/scan/Scan.py in runtime(self)
    175 
    176         from .scan_run import scan_run
--> 177         self = scan_run(self)

~/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/talos/scan/scan_run.py in scan_run(self)
     32     # finish
     33     from ..logging.logging_finish import logging_finish
---> 34     self = logging_finish(self)
     35 
     36     from .scan_finish import scan_finish

~/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/talos/logging/logging_finish.py in logging_finish(self)
      4 
      5     # save the results
----> 6     self = result_todf(self)
      7 
      8     return self

~/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/talos/logging/results.py in result_todf(self)
     46     cols = self.result[0]
     47     self.result = pd.DataFrame(self.result[1:])
---> 48     self.result.columns = cols
     49 
     50     return self

~/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/pandas/core/generic.py in __setattr__(self, name, value)
   4383         try:
   4384             object.__getattribute__(self, name)
-> 4385             return object.__setattr__(self, name, value)
   4386         except AttributeError:
   4387             pass

pandas/_libs/properties.pyx in pandas._libs.properties.AxisProperty.__set__()

~/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/pandas/core/generic.py in _set_axis(self, axis, labels)
    643 
    644     def _set_axis(self, axis, labels):
--> 645         self._data.set_axis(axis, labels)
    646         self._clear_item_cache()
    647 

~/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/pandas/core/internals.py in set_axis(self, axis, new_labels)
   3321             raise ValueError(
   3322                 'Length mismatch: Expected axis has {old} elements, new '
-> 3323                 'values have {new} elements'.format(old=old_len, new=new_len))
   3324 
   3325         self.axes[axis] = new_labels

ValueError: Length mismatch: Expected axis has 21 elements, new values have 11 elements
mikkokotila commented 5 years ago

Can you try the v.0.6 version:

pip install git+http://github.com/autonomio/talos@daily-dev

The logging aspect is entirely rebuilt in it, and these kinds of issues should no longer appear.

bjtho08 commented 5 years ago

@mikkokotila I am already running 0.6 which I installed after the latest merge from params-api. Did you make significant changes to the logging in the last month or so?

mikkokotila commented 5 years ago

I did not. Thanks for clarification. I'll try to look into this tomorrow.

mikkokotila commented 5 years ago

In the meantime, can you share your experiment log? Or the first rows of it.

bjtho08 commented 5 years ago

Sorry, I am not sure what you are referring to. Do you mean the output during scanning?

mikkokotila commented 5 years ago

On the machine where you are working, there is a a .csv file being updated on each permutation throughout the experiment, which is identical to what is being stored in the end to the scan_object. That one.

bjtho08 commented 5 years ago

Here it is (i renamed it .txt because github complained about the .csv format) 062519213324.csv.txt

bjtho08 commented 5 years ago

Looking at the log, it does seem like some weird stuff is going on in the first and last columns. The last column looks like it should have been 11 individual columns, based on the contents of the bottom row, but I don't see why it would think so. Is it related to how I pass static and dynamic parameters into the model? I made a nested function for my model to handle some technical difficulties related to generators and in the beginning of the inner function, I update the params dictionary with all the static parameters that I pass to the outer function, such that I can freely switch back and forth between having the parameters be part of the parameter space and not take up unnecessary in the log file or clutter the correlation matrix.

bjtho08 commented 5 years ago

I've taken the time to look closer at the log file and it is indeed a concatenation of the static and grid params, but there is something odd about the header column. One of the grid param keys is missing from the header, which means that even if I would remove the concatenation step, it would still have resulted in an error. How is the header column generated? I would have assumed that it reads the param.keys(), but then the missing header ("class_weights") should not be possible. I Hope you have enough information to find a solution. I am thinking of creating an internal dict in the inner function and updating it with both the params and static_params dicts to preserve the original params dict.

mikkokotila commented 5 years ago

Just to be safe, can you do:

import talos
talos.__version__

Also, can you run a minimal example for a sanity check:

import talos as ta
from keras.models import Sequential
from keras.layers import Dense

def minimal():

    x, y = ta.templates.datasets.iris()

    p = {'activation':['relu', 'elu'],
         'optimizer': ['Nadam', 'Adam'],
         'losses': ['logcosh'],
         'hidden_layers':[0, 1, 2],
         'batch_size': [20,30,40],
         'epochs': [10,20]}

    def iris_model(x_train, y_train, x_val, y_val, params):

        model = Sequential()
        model.add(Dense(32, input_dim=4, activation=params['activation']))
        model.add(Dense(3, activation='softmax'))
        model.compile(optimizer=params['optimizer'], loss=params['losses'])

        out = model.fit(x_train, y_train,
                         batch_size=params['batch_size'],
                         epochs=params['epochs'],
                         validation_data=[x_val, y_val],
                         verbose=0)

        return out, model

    scan_object = ta.Scan(x, y, model=iris_model, params=p, fraction_limit=0.1)

    return scan_object

minimal()

Let's see if the same issue persists.

mikkokotila commented 5 years ago

I can see that the parameter dictionary you are posting above does not match the model or the expected output you mention. Is it that you add parameters after starting the experiment? The headers columns are based on reading the parameter dictionary at the beginning of the experiment. So that would cause a problem.

bjtho08 commented 5 years ago

Here is the requested version output:

Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import talos
/home/bjarne/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
>>> talos.__version__
'0.6.0'

I am ssh'ing from my phone, so the sanity check will have to wait until later.

Regarding the parameters, as I mentioned, I am combining two dictionaries inside the model function in order to minimize rewriting code when I want to change parameters to test while also keeping the number of tracked parameters to a minimum, in order to avoid cluttering the correlation matrix with static parameters.

bjtho08 commented 5 years ago

I ran your MWE and it completed without error. The output csv is here: 062719235236.csv.txt

mikkokotila commented 5 years ago

Regarding the parameters, as I mentioned, I am combining two dictionaries inside the model function in order to minimize rewriting code when I want to change parameters to test while also keeping the number of tracked parameters to a minimum, in order to avoid cluttering the correlation matrix with static parameters.

That will be the cause of this. What we do is create a record for the parameters at the beginning of the scan, and then universally use that throughout the pipeline to ensure consistency. If you load all your parameters in the params dictionary as intended use assumes, you will no longer suffer from this.

It seems that you have a justification for doing things otherwise, so I think we can explore that as a feature request. The best way to do that would be to create a new ticket for that specific request, and provide an example where you use the minimal() from above, but altered in a way that makes your case blatantly clear.

Regarding this ticket, I'll leave it open just in case for a couple of days, but it seems that the issue itself is resolved by a) using the intended approach b) escalating to a feature requested the new use.

Thanks a lot for working together to figure this out. Have a nice day ahead too :)

bjtho08 commented 5 years ago

I decided to rewrite the function such that all parameters are copied to an internalized dictionary at the very start of the inner function which avoids tampering with the params dict fed by talos.Scan. This fixed the issue. For reference, the relevant changes are:

def talos_presets(weight_path, cls_wgts, static_params, train_generator, val_generator):

    ...

    def talos_model(x, y, val_x, val_y, params):

        ...

        # Dummy inputs
        _ = x, y, val_x, val_y
        internal_params = OrderedDict()
        internal_params.update(static_params)
        internal_params.update(params)

        ...

        return history, model
    return talos_model

Thanks for helping me with the bug hunt! It's nice to work with committed devs! Have a great day :)

mikkokotila commented 5 years ago

Thanks to you too :) I'm glad you got it sorted out. And likewise, have a nice day!

ebolotin6 commented 5 years ago

I ran into this issue today. I was passing _inputshape and _numclasses into my params dict (with the goal of consolidating all model-related variables in one place), which was the apparent cause of the issue. After removing these variables from the params dict, the error went away.