try to implement PSMNet use yours instructions

passion3394 commented 4 years ago

stereo_net = Nets.get_stereo_net(args.modelName, net_args)

the above code runs good, but when I start to train on scene_flow, got the following error. I have done some debug methods: (1) check the scene_flow input, but it's ok (2) check the network, but I didn't find some errors

Could you help to find the bug, thanks very much. If more files needed, I will upload them.

Traceback (most recent call last): File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call return fn(*args) File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found. (0) Invalid argument: input depth must be evenly divisible by filter depth: 4 vs 3 [[{{node model/CNN_2/conv0/conv0_1/Conv2D}}]] (1) Invalid argument: input depth must be evenly divisible by filter depth: 4 vs 3 [[{{node model/CNN_2/conv0/conv0_1/Conv2D}}]] [[validation_error/truediv_1/_23]] 0 successful operations. 0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "Train.py", line 191, in main(args) File "Train.py", line 140, in main fetches = sess.run(tf_fetches,options=run_options) File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950, in run run_metadata_ptr) File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run feed_dict_tensor, options, run_metadata) File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run run_metadata) File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found. (0) Invalid argument: input depth must be evenly divisible by filter depth: 4 vs 3 [[node model/CNN_2/conv0/conv0_1/Conv2D (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py:59) ]] (1) Invalid argument: input depth must be evenly divisible by filter depth: 4 vs 3 [[node model/CNN_2/conv0/conv0_1/Conv2D (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py:59) ]] [[validation_error/truediv_1/_23]] 0 successful operations. 0 derived errors ignored.

Errors may have originated from an input operation. Input Source operations connected to node model/CNN_2/conv0/conv0_1/Conv2D: model/MirrorPad_2 (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Data_utils/preprocessing.py:28) model/CNN/conv0/conv0_1/weights/read (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py:57)

Input Source operations connected to node model/CNN_2/conv0/conv0_1/Conv2D: model/MirrorPad_2 (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Data_utils/preprocessing.py:28) model/CNN/conv0/conv0_1/weights/read (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py:57)

Original stack trace for 'model/CNN_2/conv0/conv0_1/Conv2D': File "Train.py", line 191, in main(args) File "Train.py", line 71, in main val_stereo_net = Nets.get_stereo_net(args.modelName, net_args) File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/init.py", line 15, in get_stereo_net return STEREO_FACTORYname File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/PSMNet.py", line 22, in init super(PSMNet, self).init(*kwargs) File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/Stereo_net.py", line 44, in init self._build_network(args) File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/PSMNet.py", line 54, in _build_network conv4_left = self.CNN(self._left_input_batch) File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/PSMNet.py", line 71, in CNN activation=tf.nn.leaky_relu, batch_norm=True, apply_relu=True,strides=2, name='conv0_1',reuse=reuse)) File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py", line 59, in conv2d x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding=padding,dilations=dilation_rate) File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/ops/nn_ops.py", line 1953, in conv2d name=name) File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1071, in conv2d data_format=data_format, dilations=dilations, name=name) File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper op_def=op_def) File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func return func(args, **kwargs) File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op op_def=op_def) File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 2005, in init self._traceback = tf_stack.extract_stack()

AlessioTonioni commented 4 years ago

Without seeing the code I cannot provide much help, can you create a pull request with your code?

passion3394 commented 4 years ago

@AlessioTonioni Sure, I will create a pull request soon.

passion3394 commented 4 years ago

I have created a pull request, thanks for your help.

passion3394 commented 4 years ago

@AlessioTonioni Any updates?

AlessioTonioni commented 4 years ago

Do the images that you are using have 4 channels for some reason? Like RGB + alpha

passion3394 commented 4 years ago

@AlessioTonioni Yes, I use the scene flow dataset, the images contains 4 channels. I did two things: (1)in Nets/PSMNet.py, I try to print the shape of left_input_batch in function _preprocess_inputs. it outputs [1,256,512,3]. so I think the input is ok.

(2) I try to train MADNet on this dataset, got the same error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found. (0) Invalid argument: input depth must be evenly divisible by filter depth: 4 vs 3 [[node model/gc-read-pyramid_1/conv1/Conv2D (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py:59) ]] [[training_error/Sum_12/_57]] (1) Invalid argument: input depth must be evenly divisible by filter depth: 4 vs 3 [[node model/gc-read-pyramid_1/conv1/Conv2D (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py:59) ]] 0 successful operations. 0 derived errors ignored.

Errors may have originated from an input operation. Input Source operations connected to node model/gc-read-pyramid_1/conv1/Conv2D: model/MirrorPad_1 (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Data_utils/preprocessing.py:28) model/gc-read-pyramid/conv1/weights/read (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py:57)

Input Source operations connected to node model/gc-read-pyramid_1/conv1/Conv2D: model/MirrorPad_1 (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Data_utils/preprocessing.py:28) model/gc-read-pyramid/conv1/weights/read (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py:57)

Original stack trace for 'model/gc-read-pyramid_1/conv1/Conv2D': File "Train.py", line 191, in main(args) File "Train.py", line 63, in main stereo_net = Nets.get_stereo_net(args.modelName, net_args) File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/init.py", line 15, in get_stereo_net return STEREO_FACTORYname File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/MadNet.py", line 24, in init super(MadNet, self).init(*kwargs) File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/Stereo_net.py", line 44, in init self._build_network(args) File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/MadNet.py", line 260, in _build_network self._pyramid_features(self._right_input_batch, scope='gc-read-pyramid', reuse=True, layer_prefix='right') File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/MadNet.py", line 183, in _pyramid_features )[-1].value, 16], strides=2, name='conv1', bName='biases', activation=activation)) File "/host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py", line 59, in conv2d x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding=padding,dilations=dilation_rate) File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/ops/nn_ops.py", line 1953, in conv2d name=name) File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1071, in conv2d data_format=data_format, dilations=dilations, name=name) File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper op_def=op_def) File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func return func(args, **kwargs) File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op op_def=op_def) File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 2005, in init self._traceback = tf_stack.extract_stack()

So how to fix this error?

AlessioTonioni commented 4 years ago

If you get the same error with MADNet and (I guess) Dispnet then I think you have either a problem with your data or with the version of tensorflow you are using. Can you share a stereo couple + gt from the dataset that you are trying to use? Can you run Stereo_Online_Adaptation.py? Or do you get an error there as well?

passion3394 commented 4 years ago

Firstly, a sample of the stereo couple + gt has been attached. sceneflow_samples.tar.gz

Secondly, I try to run Stereo_Online_Adaptation.py with the following script:

LIST="/host/nfs/hs/scene_flow/val.list" #the one described at step (1) OUTPUT="_sf_output/" WEIGHTS="pretrained/MADNet/synthetic/weights.ckpt" MODELNAME="MADNet" BLOCKCONFIG="block_config/MadNet_full.json"

python3 Stereo_Online_Adaptation.py \ -l ${LIST} \ -o ${OUTPUT} \ --weights ${WEIGHTS} \ --modelName ${MODELNAME} \ --blockConfig ${BLOCKCONFIG} \ --mode FULL \ --imageShape 256 512 \ --sampleMode PROBABILITY \ --logDispStep 1

GOT THE FOLLOWING ERROR, which is not the same with the previous one.

WARNING:tensorflow:From /root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists(from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file APIs to check for files with this prefix. Traceback (most recent call last): File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call return fn(*args) File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found. (0) Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [3,3,2,16] rhs shape= [3,3,3,16] [[{{node save/Assign_75}}]] [[save/RestoreV2/_42]] (1) Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [3,3,2,16] rhs shape= [3,3,3,16] [[{{node save/Assign_75}}]] 0 successful operations. 0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/training/saver.py", line 1286, in restore {self.saver_def.filename_tensor_name: save_path}) File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950, in run run_metadata_ptr) File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run feed_dict_tensor, options, run_metadata) File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run run_metadata) File "/root/anaconda3/envs/MADNet/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found. (0) Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [3,3,2,16] rhs shape= [3,3,3,16] [[node save/Assign_75 (defined at Stereo_Online_Adaptation.py:152) ]] [[save/RestoreV2/_42]] (1) Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [3,3,2,16] rhs shape= [3,3,3,16] [[node save/Assign_75 (defined at Stereo_Online_Adaptation.py:152) ]] 0 successful operations. 0 derived errors ignored.

Errors may have originated from an input operation. Input Source operations connected to node save/Assign_75: model/gc-read-pyramid/conv1/weights (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py:57)

Input Source operations connected to node save/Assign_75: model/gc-read-pyramid/conv1/weights (defined at /host/nfs/hs/code/research/Real-time-self-adaptive-deep-stereo/Nets/sharedLayers.py:57)

AlessioTonioni commented 4 years ago

Firstly, a sample of the stereo couple + gt has been attached. sceneflow_samples.tar.gz

Your images have 4 channels, rgb + alpha. I just pushed a commit to explicitly remove the extra channel after having read the images. Let me know if it fixes your problems.

The second error you are getting seems to be related to the operation to restore weights on the first conv. I Think it's still an issue with the number of channels in the input image (4 with your images rather than 3) so it might be fixed as well. But if it's not working let me know.

passion3394 commented 4 years ago

The first error about Train.py is fixed.

The second error still exists, and the same error.

AlessioTonioni commented 4 years ago

I cannot replicate the error on my side, which version of tensorflow are you using?

The error seems to be related to how weights are restored.

passion3394 commented 4 years ago

I install tensorflow-gpu with pip, and I can't install tf1.12, so I installed tf1.14.0.

Another info I ran Stereo_Online_Adaptation.py with another dataset whose channels is 3, and the result is ok.

AlessioTonioni commented 4 years ago

I'm testing with the 1.12, the images you send, the weights available online, and everything seems to work.

Any other insight on what is happening?

passion3394 commented 4 years ago

so far, about this question, it's very strange that: (1) use 4-channels images to run Stereo_Online_Adaptation.py, issue occured: Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [3,3,2,16] rhs shape= [3,3,3,16]

(2)use 3-channels images to run Stereo_Online_Adaptation.py, everything runs ok.

I doubt it's about the checkpoint saved with 3-channels images training. I will convert the 4-channels images to 3-channels images to validate this issue. ==============================split line===================== At the same time, I continued to train PSMNet on sceneflow dataset, I got nan in losses, I doubt the nan issue lies on the PSMNet network. Do you have any instructions?

Step:22800 Loss:17.37 Step:22900 Loss:33.37 Step:23000 Loss:36.77 Step:23100 Loss:28.94 Step:23200 Loss:18.60 Step:23300 Loss:17.36 Step:23400 Loss:24.45 Step:23500 Loss:28.62 Step:23600 Loss:63.38 Step:23700 Loss:66.96 Step:23800 Loss:104.30 Step:23900 Loss:51.45 Step:24000 Loss:31.38 Step:24100 Loss:18.41 Step:24200 Loss:17.72 Step:24300 Loss:64.83 Step:24400 Loss:94.23 Step:24500 Loss:20.95 Step:24600 Loss:15.28 Step:24700 Loss:21.73 Step:24800 Loss:17.40 Step:24900 Loss:nan Step:25000 Loss:nan Step:25100 Loss:nan Step:25200 Loss:nan Step:25300 Loss:nan Step:25400 Loss:nan Step:25500 Loss:nan Step:25600 Loss:nan f/b time:0.603486 Missing time:1 day, 1:53:40.462395 f/b time:0.602048 Missing time:1 day, 1:48:58.175722 f/b time:0.601132 Missing time:1 day, 1:45:36.700135 f/b time:0.596681 Missing time:1 day, 1:33:10.308483 f/b time:0.598654 Missing time:1 day, 1:37:14.649814 f/b time:0.595730 Missing time:1 day, 1:28:44.552802 f/b time:0.596634 Missing time:1 day, 1:30:04.066876 f/b time:0.598686 Missing time:1 day, 1:34:19.910125 f/b time:0.599219 Missing time:1 day, 1:34:41.985614 f/b time:0.599685 Missing time:1 day, 1:34:53.615409 f/b time:0.599060 Missing time:1 day, 1:32:17.753413 f/b time:0.595160 Missing time:1 day, 1:21:19.704973 f/b time:0.598833 Missing time:1 day, 1:29:43.103316 f/b time:0.601884 Missing time:1 day, 1:36:30.574517 f/b time:0.596067 Missing time:1 day, 1:20:39.928160 f/b time:0.591668 Missing time:1 day, 1:08:27.451441 f/b time:0.594829 Missing time:1 day, 1:15:31.437283 f/b time:0.599308 Missing time:1 day, 1:25:56.208433 f/b time:0.598947 Missing time:1 day, 1:24:01.202895 f/b time:0.596817 Missing time:1 day, 1:17:36.314750 f/b time:0.590698 Missing time:1 day, 1:01:03.701733 f/b time:0.589800 Missing time:1 day, 0:57:47.887786 f/b time:0.585932 Missing time:1 day, 0:46:59.878661 f/b time:0.589257 Missing time:1 day, 0:54:27.297112 f/b time:0.588775 Missing time:1 day, 0:52:15.065096 f/b time:0.585430 Missing time:1 day, 0:42:47.741276 f/b time:0.581155 Missing time:1 day, 0:31:00.059053 f/b time:0.587691 Missing time:1 day, 0:46:33.836580 f/b time:0.586745 Missing time:1 day, 0:43:11.544342

AlessioTonioni commented 4 years ago

As for 1, which images are you using? The one you sent me? Because I'm able to use Stereo_Online_Adaptation.py without any issue on them.

As for PSMNet the loss seems quite high after 25K steps, is it going down? Like what does the plot in tensorboard look like? Do the prediction start to look like anything reasonable? Otherwise there might be some issue with the atchitecture.

passion3394 commented 4 years ago

for 1, I use the same images as you tested.

I could not see the tensorboard, because the gpu is in the cloud. But I will try to visualize in tensorboard.

passion3394 commented 4 years ago

I have run tensorboard on the remote terminal. Could you help to analysis the error. I think the error is about the network.

PSMNet_error PSMNet_error2

AlessioTonioni commented 4 years ago

Are you using the reprojection loss to train the network from scratch? I would advise you to use the supervised loss as done in train.py.

In general from this plots you can see that the loss is not going down at all, so there is definitely some implementation problem in the network or you still have some trouble with your data. If you train dispnet or MADNet are you able to see the loss going down?

AkshatVashisht commented 3 years ago

Does it doing depth estimation on real time? Is it producing depth continuously from the cameras?

CVLAB-Unibo / Real-time-self-adaptive-deep-stereo

try to implement PSMNet use yours instructions #51