LinHungShi / GCNetwork

118 stars 39 forks source link

Problem running the training script #10

Open jucab opened 7 years ago

jucab commented 7 years ago

Hi, Congrats for your nice work. I am trying to run the training script but I get the following error: "TypeError: Value passed to parameter 'paddings' has DataType float32 not in list of allowed values: int32, int64" in gcnetwork.py, when calling
cv = Lambda(getCostVolume, arguments = {'max_d':d/2}, output_shape = (d/2, None, None, num_filters * 2))(unifeature) Any suggestion? Thanks, Julian

LinHungShi commented 7 years ago

Do you change the any hyperparameters in hyperparam.json file? What python version do you use? If you use Python 3.x, the division operation implicitly converts integer to float. What you have to do is to explicitly convert the result of division to integer. Replace

cv = Lambda(getCostVolume, arguments = {'max_d':d/2}, output_shape = (d/2, None, None, num_filters * 2))(unifeature)

with

cv = Lambda(getCostVolume, arguments = {'max_d':int(d/2)}, output_shape = (int(d/2), None, None, num_filters * 2))(unifeature)

jucab commented 7 years ago

Yes, I am using Python 3.5 and I have not changed the hyperparameters. I have replaced the code and the previous error has been corrected. However, now I get "ValueError: Operands could not be broadcast together with shapes (12, None, None, 64) (96, None, None, 64)" in File "train.py", line 100, in trainSceneFlowData(hp, tp, up, env, callbacks, weight_path = weight_path) File "train.py", line 64, in trainSceneFlowData model = createGCNetwork(hp, tp, pre_weight) File "src/gcnetwork.py", line 154, in createGCNetwork disp_map = LearnReg(cv, num_filters, ksize, ds_stride, resnet, padding, highway_func, num_down_conv) File "src/gcnetwork.py", line 127, in LearnReg up_convs = add([deconv, down_convs[i+1]]) Thanks

LinHungShi commented 7 years ago

Hi, I have updated the code, please download the new version, and run "python train.py" to see if it works.

jucab commented 7 years ago

Hi. I have had to unify the use of tabs and spaces in the files (python 3.5 complains a lot ...) and I have had to do some minor modifications to the load_pfm file. It is now running although it is very slow. I am running it on a TITAN X but it seems the program is not using it properly. The % of Volatile GPU-Util is most of the time 0% while running. Any idea of what I am missing? Thanks

LinHungShi commented 7 years ago

Are you sure that you’re running the job on GPU?

On Mon, Oct 9, 2017 at 11:27 jucab notifications@github.com wrote:

Hi. I have had to unify the use of tabs and spaces in the files (python 3.5 complains a lot ...) and I have had to do some minor modifications to the load_pfm file. It is now running although it is very slow. I am running it on a TITAN X but it seems the program is not using it properly. The % of Volatile GPU-Util is most of the time 0% while running. Any idea of what I am missing? Thanks

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/LinHungShi/GCNetwork/issues/10#issuecomment-335192353, or mute the thread https://github.com/notifications/unsubscribe-auth/AMAM6Uq2N2G8Y22xSBHBRkV-ikdsria8ks5sqjtWgaJpZM4Pl21O .

-- Hung Shi Lin Data Science Institute, Columbia University, New York, New York, U.S

jucab commented 7 years ago

I think so. With log_device_placement flag I can see that the tasks are assigned to the gpu. Indeed the the gpu memory is allocated

LinHungShi commented 7 years ago

I don't know how that happened, but you can look up the similar issue here https://github.com/tensorflow/tensorflow/issues/543

jucab commented 7 years ago

Thanks. Since I have been changing tabs and spaces, I may have unintentionally change the code. Could you check that the GCnetwork is right in this file? thanks gcnetwork.py.tar.gz

jucab commented 7 years ago

I found out the problem. It was in the generator.py file. I messed up with the tabs and spaces. It is running now properly on the GPU. By the way, I get this warning "UserWarning: Update your fit_generator call to the Keras 2 API: fit_generator(<generator..., validation_steps=880, callbacks=[<keras.ca..., validation_data=<generator..., steps_per_epoch=3520, epochs=50, max_queue_size=1) but I think your are already calling fit_generator on Keras 2, aren't you?

LinHungShi commented 7 years ago

It seems they change API in Keras 2.0. This is just a warning though.