[Keras] Failed to restore model when using separate encoders

JaCoderX commented 4 years ago

@Kismuz,

I have recently tried to play around with using separate encoders (following suggestions from #35)

I took a working encoder and modified 'external' according to the suggestions. The model run with no errors.

But after I stop training I will get the following error when trying to continue training

NOTICE: Worker_0: no saved model parameters found in: INFO:tensorflow:Restoring parameters from ../current_train_checkpoint/model_parameters-1425998 W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key PPO/global/encoded_external_1/conv1d_1/bias/Adam not found in checkpoint NOTICE: Worker_0: failed to restore model parameters from: ../current_train_checkpoint [NOTICE: Worker_0: training from scratch...

At first I checked the graph on tensorboard to see if 'conv1d_1' is under 'encoded_external_1' but it wasn't. So I used OrderedDict to fix the correlation issue. but unfortunately it didn't solve the fail-to-restore problem

    external = [
        ('1', spaces.Box(low=-10, high=10, shape=(128, 1, num_features), dtype=np.float32), ),
        ('2', spaces.Box(low=-10, high=10, shape=(128, 1, num_features), dtype=np.float32), ),
        ('3', spaces.Box(low=-10, high=10, shape=(128, 1, num_features), dtype=np.float32), ),
    ]
    external = OrderedDict(external)

    params = dict(
        state_shape={
            'external': DictSpace(external),

any suggestions on how to address this issue?

Kismuz commented 4 years ago

@JacobHanouna, message says it failed to found gradient op for encoder but it is quite hard to judge without explicit inspection and comparison of graph ops before and after restore.

JaCoderX commented 4 years ago

@Kismuz, I think I found what is causing this issue (more or less) I was working on a new design for the encoder and I used Keras to implement it along side the existing network. It worked with no issues when I worked on a single encoder network, but when I switched to separate encoders network I get the reported problem.

So, Technically this isn't a BTGym problem but Keras is really easy and fun to work with and except for this issue I find that it integrate quite good with BTGym.

Is there a way to fix it easily? would upgrading to tensorflow 2.0 solve the issue?

Kismuz commented 4 years ago

Maybe explicitly warping every keras-based encoder in it's own tf.name_scope with reuse=False option will do.

JaCoderX commented 4 years ago

I gave it a try but it didn't showed any positive results

Kismuz commented 4 years ago

Can you post relevant piece of code to reproduce the error?

JaCoderX commented 4 years ago

In this example, I use the following Keras implementation to perform skip Connection in the encoder part.

def conv_1d_casual_encoder
...
...
            y_skip_conv1d = y

            y = tf.reshape(y, [-1, conv_1d_filter_size, channels], name='layer_{}_t2b'.format(i))

            y = conv1d(
                x=y,
                num_filters=conv_1d_num_filters,
                filter_size=conv_1d_filter_size,
                stride=1,
                pad='VALID',
                name='conv1d_layer_{}'.format(i)
            )

            y = tf.reshape(y, [-1, num_time_batches, conv_1d_num_filters], name='layer_{}_output'.format(i))

            y = norm_layer(y)

            y = skip_connection(input=y_skip_conv1d, residual=y)
...
...

def skip_connection(input, residual):
    """Adds a shortcut between input and residual block and merges them with "sum"
    """
    # Expand channels of shortcut to match residual.
    # Stride appropriately to match residual (width, height)
    # Should be int if network architecture is correctly configured.
    with tf.variable_scope(name_or_scope='SkipConn', reuse=False):
        ROW_AXIS = 1
        CHANNEL_AXIS = 2

        input_shape = K.int_shape(input)
        residual_shape = K.int_shape(residual)
        stride = int(round(input_shape[ROW_AXIS] / residual_shape[ROW_AXIS]))
        equal_channels = input_shape[CHANNEL_AXIS] == residual_shape[CHANNEL_AXIS]

        shortcut = input
        # 1 X 1 conv if shape is different. Else identity.
        if stride > 1 or not equal_channels:
            shortcut = Conv1D(filters=residual_shape[CHANNEL_AXIS],
                              kernel_size=1,
                              strides=stride,
                              padding="valid",
                              kernel_initializer="he_normal",
                              kernel_regularizer=l2(0.0001)
                              )(input)

        return add([shortcut, residual])

Kismuz commented 4 years ago

point was name scopes should be different for every encoder:

def skip_connection(input, residual, name):
....
with tf.variable_scope(name_or_scope='SkipConn_{}'.format(name), reuse=False):
...
...

calling it:

y = skip_connection(input=y_skip_conv1d, residual=y, name=str(i))

JaCoderX commented 4 years ago

@Kismuz, I followed your suggestion but still got the same issue.

Then I tried changing Keras layer name directly and it seem to solve the issue. so to conclude this issue, a fix for the above example would be just to add a name (doesn't even need to be unique as Keras would auto take care of that)

...
shortcut = Conv1D(filters=residual_shape[CHANNEL_AXIS],
                              kernel_size=1,
                              strides=stride,
                              padding="valid",
                              kernel_initializer="he_normal",
                              kernel_regularizer=l2(0.0001),
                              name='SkipConn'
                              )(input)
...

Kismuz / btgym

[Keras] Failed to restore model when using separate encoders #119