ValueError: Shapes are not compatible when training transition model after autoencoder trained

chqsark commented 8 years ago

Hi,

I trained the autoencoder successfully. However, when I was doing followup steps in training the transition model, I had the problem below.

Thanks a lot for any help!

BTW, I didn't change any code.

chqsark commented 8 years ago

can someone take a look? thanks!

lxgen commented 8 years ago

Have you solved the problem? I have the same problem as you.

EderSantana commented 8 years ago

this is weird, see that you have two outputs with shape (?, 64, 512) and (9, 2, 512). They should be (64, 5, 512) and (64, 9, 512). But K.rnn is messing the shapes up. I'll check what is going on.

In case you want to investigate as well, here is where the bug should be happening https://github.com/commaai/research/blob/master/models/layers.py#L359-L374

What is your keras version by the way?

EderSantana commented 8 years ago

So, for the now the only place that I see could be cause this problem is the consume_less RNN parameter in Keras. Try changing https://github.com/commaai/research/blob/master/models/transition.py#L41-L42 to:

model.add(DreamyRNN(output_dim=z_dim, output_length=out_leng-1, return_sequences=True,
                    activation="tanh", consume_less="not_cpu", batch_input_shape=(batch_size, time, z_dim)))

Unfortunately I can't reproduce your bug right now. But I'll give you more information as soon as I get an opportunity.

chqsark commented 8 years ago

@EderSantana Thanks a lot for the response. I've tried keras 1.0.6 and 1.0.8, tensorflow 0.9, and 0.10. All gave the same error. I still got the error after changing transition.py as you suggested.

I realized that comma.ai has a fork of keras. Should I use that instead of the original one? Or any specific branch of keras?

EderSantana commented 8 years ago

no I tried this code on Keras public release. I think the problem is with the recurrent layer consume_less parameter. But I can't test it right now :(

chqsark commented 8 years ago

I just tried 'cpu', 'gpu', 'mem' for consume_less parameter. No luck :(

My server.py output is like this guan.wang@Z440SJ-243:~/ml/comma/research$ ./server.py --time 60 --batch 64 INFO:main:server started INFO:dask_generator:Loading 9 hdf5 buckets. x 52722 | t 263583 | f 52722 x 58993 | t 294919 | f 58993 x 19731 | t 98719 | f 19731 x 56166 | t 280785 | f 56166 x 25865 | t 129344 | f 25865 x 85296 | t 426596 | f 85296 x 78463 | t 392182 | f 78463 x 30538 | t 152650 | f 30538 x 51691 | t 258571 | f 51691 training on 436627/459465 examples INFO:dask_generator:camera files 9 4296.05 ms X (64, 60, 3, 160, 320) angle (64, 60, 1) speed (64, 60, 1)

EderSantana commented 8 years ago

@chqsark thanks for the information. I'll continue investigating.

kamal94 commented 8 years ago

Suffering the same problem. Any thoughts?

lxgen commented 8 years ago

I have solved the problem by changing the Keras version from 1.0.8 to 1.0.6.

chqsark commented 8 years ago

I also solved it by completely removing keras and install the 1.0.6 version. Previously I tried virtualenv for 1.0.6 and it didn't work. Maybe my package system messed it up. Now it started running. Just the server side generates the following periodically.

Traceback (most recent call last): File "/home/guan.wang/ml/comma/research/dask_generator.py", line 109, in datagen X_batch[count] = x[i-es-time_len+1:i-es+1] File "/usr/lib/python2.7/dist-packages/h5py/_hl/dataset.py", line 419, in getitem selection = sel.select(self.shape, args, dsid=self.id) File "/usr/lib/python2.7/dist-packages/h5py/_hl/selections.py", line 91, in select sel[args] File "/usr/lib/python2.7/dist-packages/h5py/_hl/selections.py", line 258, in getitem start, count, step, scalar = _handle_simple(self.shape,args) File "/usr/lib/python2.7/dist-packages/h5py/_hl/selections.py", line 509, in _handle_simple x,y,z = _translate_slice(arg, length) File "/usr/lib/python2.7/dist-packages/h5py/_hl/selections.py", line 550, in _translate_slice raise ValueError("Reverse-order selections are not allowed") ValueError: Reverse-order selections are not allowed

Yale323 commented 8 years ago

@EderSantana I also have the same situation. After install Keras 1.0.6 and start the training of transition, there is two kind of errors in the server side. One is the "ValueError: Reverse-order selections are not allowed" Traceback (most recent call last): File "/home/yale/research/dask_generator.py", line 109, in datagen X_batch[count] = x[i-es-time_len+1:i-es+1] File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2684) File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2642) File "/home/yale/anaconda2/envs/tensorflow/lib/python2.7/site-packages/h5py/_hl/dataset.py", line 462, in getitem selection = sel.select(self.shape, args, dsid=self.id) File "/home/yale/anaconda2/envs/tensorflow/lib/python2.7/site-packages/h5py/_hl/selections.py", line 92, in select sel[args] File "/home/yale/anaconda2/envs/tensorflow/lib/python2.7/site-packages/h5py/_hl/selections.py", line 259, in getitem start, count, step, scalar = _handle_simple(self.shape,args) File "/home/yale/anaconda2/envs/tensorflow/lib/python2.7/site-packages/h5py/_hl/selections.py", line 443, in _handle_simple x,y,z = _translate_slice(arg, length) File "/home/yale/anaconda2/envs/tensorflow/lib/python2.7/site-packages/h5py/_hl/selections.py", line 484, in _translate_slice raise ValueError("Reverse-order selections are not allowed") ValueError: Reverse-order selections are not allowed

The other is the "could not broadcast input array from shape (5,1) into shape (60,1)" _Traceback (most recent call last): File "/home/yale/research/dask_generator.py", line 112, in datagen angle_batch[count] = np.copy(angle[i-timelen+1:i+1])[:, None] ValueError: could not broadcast input array from shape (5,1) into shape (60,1)

Is it related to the different Keras version?

andrewraharjo commented 8 years ago

@chqsark how did you completely remove keras ? Was it out of your virtualenv or under your virtualenv or conda environment ?

@EderSantana I got this issue "/home/dev-box/anaconda2/envs/python2/lib/python2.7/site-packages/tensorflow/python/framework/common_shapes.py", line 246, in conv2d_shape padding)

File "/home/dev-box/anaconda2/envs/python2/lib/python2.7/site-packages/tensorflow/python/framework/common_shapes.py", line 184, in get2d_conv_output_size (row_stride, col_stride), padding_type)

File "/home/dev-box/anaconda2/envs/python2/lib/python2.7/site-packages/tensorflow/python/framework/common_shapes.py", line 149, in get_conv_output_size "Filter: %r Input: %r" % (filter_size, input_size))

ValueError: Filter must not be larger than the input: Filter: (8, 8) Input: (3, 160)

Keras 1.0.6 and TF 0.10, Theano 0.8.2

jamesjackson commented 8 years ago

@andrewraharjo I'm seeing the same issue "ValueError: Filter must not be larger than the input: Filter: (8, 8) Input: (3, 160)" on an AWS GPU instance with TF 0.10 and Keras 1.1.0. I don't get the issue if running locally on a MacBook Pro with TF 0.9 and Keras 1.0.8. I'll try setting up the AWS instance with TF 0.9.

andrewraharjo commented 8 years ago

@jamesjackson It seems that it is related with current TF build.Checkout this link I haven't given up yet with TF but the way I start the training is using Theano backend 0.8.2, keras 1.1.0 with cuDNN 5.1 though it's recommended running cuDNN 5.0. If you can't keep going with TF then try to modify your keras.json to theano and update your theanorc file by changing CPU to GPU. Oh by the way I'm not running AWS, I have use stationary dev-box

EderSantana commented 8 years ago

guys, if you are using the new tensorflow and keras make sure to pass unroll=True as input parameters to RNN layers. I had this problem with other layers as well

jamesjackson commented 8 years ago

Thanks @andrewraharjo , @EderSantana

It appears to be an odd environmental issue related to the packaging and/or Anaconda. I tried several TF/Keras versions, and they all failed in the same way. Building from source and avoiding Anaconda does work.

andrewraharjo commented 8 years ago

@jamesjackson I was thinking that way earlier and I verified with my buddy who installed use Anaconda3 and setup the virtualenv for Python 2.7. He could run with TF and I was confused why the Anaconda2 won't work. Did you solve this problem by building from source and avoid anaconda ?

EderSantana commented 8 years ago

As a note, my tensorflow was installed from source as well. (but I did use anaconda)

jamesjackson commented 8 years ago

@andrewraharjo Yeah, source-based without Anaconda is working.

chqsark commented 8 years ago

@jamesjackson Yes, source-based without Anaconda +1 @andrewraharjo Yes, I completely removed keras and reinstalled the right version.

skywong1230 commented 8 years ago

I got the same erro when I am going to run the code to train the transition model:

The error in server side: Traceback (most recent call last): File "/home/sky/research/dask_generator.py", line 112, in datagen angle_batch[count] = np.copy(angle[i-time_len+1:i+1])[:, None] ValueError: could not broadcast input array from shape (5,1) into shape (60,1)

Does anyone know the solutions?

zhaohuaqing1993 commented 7 years ago

have you run the/view_generative_model.py transition --name transition successfull?

pandamax commented 7 years ago

Traceback (most recent call last): File "/home/deep-learning/research-master/dask_generator.py", line 112, in datagen angle_batch[count] = np.copy(angle[i-time_len+1:i+1])[:, None] ValueError: could not broadcast input array from shape (55,1) into shape (60,1) same problem occured~

ahmedyahia3393 commented 6 years ago

in the view steering model.py file I found his error (ValueError: bad marshal data (unknown type code)) result when trying to execute the view steering model.py here is the result from the cmd prompt

Traceback (most recent call last): File "C:\Users\lenovo\Anaconda3\lib\site-packages\keras\utils\generic_utils.py ", line 229, in func_load raw_code = codecs.decode(code.encode('ascii'), 'base64') UnicodeEncodeError: 'ascii' codec can't encode character '\xe0' in position 46: ordinal not in range(128)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "view_steering_model.py", line 94, in model = model_from_json(json.load(jfile)) File "C:\Users\lenovo\Anaconda3\lib\site-packages\keras\models.py", line 349, in model_from_json return layer_module.deserialize(config, custom_objects=custom_objects) File "C:\Users\lenovo\Anaconda3\lib\site-packages\keras\layersinit.py", l ine 55, in deserialize printable_module_name='layer') File "C:\Users\lenovo\Anaconda3\lib\site-packages\keras\utils\generic_utils.py ", line 144, in deserialize_keras_object list(custom_objects.items()))) File "C:\Users\lenovo\Anaconda3\lib\site-packages\keras\models.py", line 1349, in from_config layer = layer_module.deserialize(conf, custom_objects=custom_objects) File "C:\Users\lenovo\Anaconda3\lib\site-packages\keras\layersinit.py", l ine 55, in deserialize printable_module_name='layer') File "C:\Users\lenovo\Anaconda3\lib\site-packages\keras\utils\generic_utils.py ", line 144, in deserialize_keras_object list(custom_objects.items()))) File "C:\Users\lenovo\Anaconda3\lib\site-packages\keras\layers\core.py", line 711, in from_config function = func_load(config['function'], globs=globs) File "C:\Users\lenovo\Anaconda3\lib\site-packages\keras\utils\generic_utils.py ", line 234, in func_load code = marshal.loads(raw_code) ValueError: bad marshal data (unknown type code)

kingxueyuf commented 6 years ago

looks like the issue is from Keras, which version are you using?

commaai / research

ValueError: Shapes are not compatible when training transition model after autoencoder trained #22