Questions about trainning the network

guoyan1991 commented 6 years ago

I have two questions and I really hope to get your help.

First, I have a big problem with training data. I want to use the ShapeNet Core dataset to repeat your experiments. So I'm going to convert the .mat file into the .ply file.I've found that I can copy the vertex array and faces array directly from the mat file into the ply file. But this approach is too complex. Do you have some simpler ways to do it？

Second, I find the following problems when I training the network of AllVP with a 3D shape(only depths from 20 views). I don't know if it's due to my incorrect input.

/install/torch/install/bin/luajit: /install/torch/install/share/lua/5.1/nn/Container.lua:67: In 3 module of nn.Sequential: ...torch/install/share/lua/5.1/cudnn/BatchNormalization.lua:44: assertion failed! stack traceback: [C]: in function 'assert' ...torch/install/share/lua/5.1/cudnn/BatchNormalization.lua:44: in function 'createIODescriptors' ...torch/install/share/lua/5.1/cudnn/BatchNormalization.lua:60: in function <...torch/install/share/lua/5.1/cudnn/BatchNormalization.lua:59> [C]: in function 'xpcall' /install/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' /install/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'func' /install/torch/install/share/lua/5.1/nngraph/gmodule.lua:345: in function 'neteval' /install/torch/install/share/lua/5.1/nngraph/gmodule.lua:380: in function 'forward' 2_train.lua:300: in function 'opfunc' /install/torch/install/share/lua/5.1/optim/adam.lua:37: in function 'adam' 2_train.lua:406: in main chunk [C]: in function 'dofile' main.lua:130: in main chunk [C]: in function 'dofile' ...tall/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: at 0x00405e90

Thanks for your help.

guoyan1991 commented 6 years ago

I set up opt.conditional=false
so model =nn.gModule({input},{reconstruction,mean,log_var}) and the error is happened in this sentence: reconstruction,mean,log_var,predClassSores=unpack(model:forward(droppedInputs)) I tried to debug the code. And I found that the main problem appeared in the section：model:forward(droppedInputs) ‘droppedInputs’ is a 1×20×224×224 torch.CudaTensor I don't know what the correct input for this nn.gModule. Thanks for your help.

Amir-Arsalan commented 6 years ago

@guoyan1991 I'm not sure what you mean by the .mat files. We did not use any .mat files for this work except when we were processing the NYUD data set to get the objects (chairs) out. Could you elaborate on this?

Regarding the error you're getting: I'm afraid you have modified the code because I do not see any code at lines 300 and or 406. Based on your second post, I guess the problem is you need to feed more than 1 sample at a time to the network. The BatchNormalization layers expect 4D tensors [N x C x R x R] where N is the number of 3D shapes, C is the number of channels (20 here) and R is the resolution. Let met know if this can resolve the issue. So you need to at least input two samples to the network such that N >= 2

guoyan1991 commented 6 years ago

First of all, thank you very much for your reply. I download the ShapeNet Core dataset from website: https://www.shapenet.org/ , I gusse the dataset of PASCAL 3D release 1.0 can be use to repeat your experiments. However, the CAD Models in compressed files all are .mat file. It can only be opened with matlab. I can't use /renderDepth/runRendering.bat to get 20 depth maps with these .mat files. Because these files are not object files. Is it the right dataset which I download ? This question confused me，looking forward to your answer.

For the error, thank you for your suggestion. I will continue to try and give you a reply soon.

Amir-Arsalan commented 6 years ago

@guoyan1991 We did not use the PASCAL 3D data set so I am not sure how you can use that. To use the rendering tool we have provided, you need to have access to the .ply files of the 3D meshes. If you want to use ShapeNet Core you can download the pre-processed data set through the links provided in the repository so that you can skip the rendering part (unless you want to render from views different than what we used).

guoyan1991 commented 6 years ago

Thank you very much. I have used the two models of the preprocessed data set you provided to training. The two problems have been solved. But there are new problems as follow：

cudnnFindConvolutionBackwardDataAlgorithm failed: 2 convDesc=[mode : CUDNN_CROSS_CORRELATION datatype : CUDNN_DATA_FLOAT] hash=-dimA4,420,56,56 -filtA420,280,4,4 4,280,112,112 -padA1,1 -convStrideA2,2 CUDNN_DATA_FLOAT
/install/torch/install/bin/luajit: /install/torch/install/share/lua/5.1/nn/Container.lua:67: In 12 module of nn.Sequential: /install/torch/install/share/lua/5.1/cudnn/find.lua:483: cudnnFindConvolutionBackwardDataAlgorithm failed, sizes: convDesc=[mode : CUDNN_CROSS_CORRELATION datatype : CUDNN_DATA_FLOAT] hash=-dimA4,420,56,56 -filtA420,280,4,4 4,280,112,112 -padA1,1 -convStrideA2,2 CUDNN_DATA_FLOAT stack traceback: [C]: in function 'error' /install/torch/install/share/lua/5.1/cudnn/find.lua:483: in function 'backwardDataAlgorithm' ...h/install/share/lua/5.1/cudnn/SpatialFullConvolution.lua:88: in function <...h/install/share/lua/5.1/cudnn/SpatialFullConvolution.lua:83> [C]: in function 'xpcall' /install/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' /install/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'func' /install/torch/install/share/lua/5.1/nngraph/gmodule.lua:345: in function 'neteval' /install/torch/install/share/lua/5.1/nngraph/gmodule.lua:380: in function 'forward' 2_train.lua:289: in function 'opfunc' /install/torch/install/share/lua/5.1/optim/adam.lua:37: in function 'adam' 2_train.lua:377: in main chunk [C]: in function 'dofile' main.lua:130: in main chunk [C]: in function 'dofile' ...tall/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: at 0x00405e90

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above. stack traceback: [C]: in function 'error' /install/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors' /install/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'func' /install/torch/install/share/lua/5.1/nngraph/gmodule.lua:345: in function 'neteval' /install/torch/install/share/lua/5.1/nngraph/gmodule.lua:380: in function 'forward' 2_train.lua:289: in function 'opfunc' /install/torch/install/share/lua/5.1/optim/adam.lua:37: in function 'adam' 2_train.lua:377: in main chunk [C]: in function 'dofile' main.lua:130: in main chunk [C]: in function 'dofile' ...tall/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: at 0x00405e90

I used the original version of the code this time, The code for line 289 of 2_train.lua is： reconstruction,mean,log_var,predClassSores=unpack(model:forward(droppedInputs)) I did not find out the reasons and solutions about this problem on the Internet. Thanks for your help！

Amir-Arsalan commented 6 years ago

@guoyan1991 I'm not sure why you are getting this error but it seems that cuDNN is complaining. What's the cuDNN version you are using? I just started re-training a model with cuDNN 7.05 and CUDA 8.0 and it works fine.

guoyan1991 commented 6 years ago

Thank you very much for your help. This problem is caused by memory of GPU.

guoyan1991 commented 6 years ago

The memory of GPU is not enough.

Amir-Arsalan commented 6 years ago

@guoyan1991 You can change the argument opt.nCh and set it to lower values or reduce opt.batchSize. If I remember correctly you need about 6-7GBs or GPU memory with the default parameters.

Amir-Arsalan / Synthesize3DviaDepthOrSil

Questions about trainning the network #3