Closed JaCoderX closed 4 years ago
@JacobHanouna, message says it failed to found gradient op for encoder but it is quite hard to judge without explicit inspection and comparison of graph ops before and after restore.
@Kismuz, I think I found what is causing this issue (more or less) I was working on a new design for the encoder and I used Keras to implement it along side the existing network. It worked with no issues when I worked on a single encoder network, but when I switched to separate encoders network I get the reported problem.
So, Technically this isn't a BTGym problem but Keras is really easy and fun to work with and except for this issue I find that it integrate quite good with BTGym.
Is there a way to fix it easily? would upgrading to tensorflow 2.0 solve the issue?
Maybe explicitly warping every keras-based encoder in it's own tf.name_scope with reuse=False
option will do.
I gave it a try but it didn't showed any positive results
Can you post relevant piece of code to reproduce the error?
In this example, I use the following Keras implementation to perform skip Connection in the encoder part.
def conv_1d_casual_encoder
...
...
y_skip_conv1d = y
y = tf.reshape(y, [-1, conv_1d_filter_size, channels], name='layer_{}_t2b'.format(i))
y = conv1d(
x=y,
num_filters=conv_1d_num_filters,
filter_size=conv_1d_filter_size,
stride=1,
pad='VALID',
name='conv1d_layer_{}'.format(i)
)
y = tf.reshape(y, [-1, num_time_batches, conv_1d_num_filters], name='layer_{}_output'.format(i))
y = norm_layer(y)
y = skip_connection(input=y_skip_conv1d, residual=y)
...
...
def skip_connection(input, residual):
"""Adds a shortcut between input and residual block and merges them with "sum"
"""
# Expand channels of shortcut to match residual.
# Stride appropriately to match residual (width, height)
# Should be int if network architecture is correctly configured.
with tf.variable_scope(name_or_scope='SkipConn', reuse=False):
ROW_AXIS = 1
CHANNEL_AXIS = 2
input_shape = K.int_shape(input)
residual_shape = K.int_shape(residual)
stride = int(round(input_shape[ROW_AXIS] / residual_shape[ROW_AXIS]))
equal_channels = input_shape[CHANNEL_AXIS] == residual_shape[CHANNEL_AXIS]
shortcut = input
# 1 X 1 conv if shape is different. Else identity.
if stride > 1 or not equal_channels:
shortcut = Conv1D(filters=residual_shape[CHANNEL_AXIS],
kernel_size=1,
strides=stride,
padding="valid",
kernel_initializer="he_normal",
kernel_regularizer=l2(0.0001)
)(input)
return add([shortcut, residual])
point was name scopes should be different for every encoder:
def skip_connection(input, residual, name):
....
with tf.variable_scope(name_or_scope='SkipConn_{}'.format(name), reuse=False):
...
...
calling it:
y = skip_connection(input=y_skip_conv1d, residual=y, name=str(i))
@Kismuz, I followed your suggestion but still got the same issue.
Then I tried changing Keras layer name directly and it seem to solve the issue. so to conclude this issue, a fix for the above example would be just to add a name (doesn't even need to be unique as Keras would auto take care of that)
...
shortcut = Conv1D(filters=residual_shape[CHANNEL_AXIS],
kernel_size=1,
strides=stride,
padding="valid",
kernel_initializer="he_normal",
kernel_regularizer=l2(0.0001),
name='SkipConn'
)(input)
...
@Kismuz,
I have recently tried to play around with using separate encoders (following suggestions from #35)
I took a working encoder and modified 'external' according to the suggestions. The model run with no errors.
But after I stop training I will get the following error when trying to continue training
At first I checked the graph on tensorboard to see if 'conv1d_1' is under 'encoded_external_1' but it wasn't. So I used OrderedDict to fix the correlation issue. but unfortunately it didn't solve the fail-to-restore problem
any suggestions on how to address this issue?