Training does not end. - Githubissues

avasisht-celadon commented 6 years ago

I have issued the command for training (svhn) as per the instructions. It does not progress at all. ########################################################################## Command : python train_svhn.py /home/aditya/stn-ocr/generated/centered/train.csv /home/aditya/stn-ocr/generated/centered/valid.csv --log-dir /home/aditya/stn-ocr -b 400 --lr 1e-5

/home/aditya/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses import imp loading data 2018-10-29 13:53:20,201 Node[0] start with arguments Namespace(batch_size=400, blank_label=0, char_map=None, checkpoint_interval=None, eval_image=None, fix_loc=False, gif=False, gpus=None, ip=None, kv_store='local', load_epoch=None, log_dir='/home/aditya/stn-ocr/2018-10-29T13:53:16.415078_training', log_file='/home/aditya/stn-ocr/2018-10-29T13:53:16.415078_training/log', log_level='INFO', log_name='training', lr=1e-05, lr_factor=1, lr_factor_epoch=1, model_prefix=None, num_epochs=10, plot_network_graph=False, port=1337, progressbar=False, save_model_prefix=None, send_bboxes=False, train_file='/home/aditya/stn-ocr/generated/centered/train.csv', val_file='/home/aditya/stn-ocr/generated/centered/valid.csv', video=False, zoom=0.9) 2018-10-29 13:53:20,202 Node[0] EPOCH SIZE: 250 2018-10-29 13:53:20,226 Node[0] Start training with [cpu(0)]

############################################################################

It stops right there. No progress.

Bartzi commented 6 years ago

Do you have a GPU in your machine? Right now you are running on CPU... that is definitely the reason why 'nothing' is happening...

avasisht-celadon commented 6 years ago

Yes sir, absolutely right. I have a GPU but had not enabled "USE_CUDA" flag in config.mk of "incubator-mxnet". I am recompiling the mxnet repo with "USE_CUDA = 1",

It threw a error:-

92 errors detected in the compilation of "/tmp/tmpxft_00002feb_0000000 0-12_cudnn_batch_norm.compute_70.cpp1.ii". Makefile:465: recipe for target 'build/src/operator/nn/cudnn/cudnn_bat ch_norm_gpu.o' failed make: *** [build/src/operator/nn/cudnn/cudnn_batch_norm_gpu.o] Error 1

Now I enabled "USE_CUDNN=1" in make/config.mk

I get an error:-

92 errors detected in the compilation of "/tmp/tmpxft_00003084_00000000-12_cudnn_batch_norm.compute_70.cpp1.ii". Makefile:465: recipe for target 'build/src/operator/nn/cudnn/cudnn_batch_norm_gpu.o' failed make: *** [build/src/operator/nn/cudnn/cudnn_batch_norm_gpu.o] Error 1

Now I enable "USE_NCCL = 1" and give path "USE_NCCL_PATH = /usr/local/cuda/lib64",

I get an error:-

92 errors detected in the compilation of "/tmp/tmpxft_00003084_00000000-12_cudnn_batch_norm.compute_70.cpp1.ii". Makefile:465: recipe for target 'build/src/operator/nn/cudnn/cudnn_batch_norm_gpu.o' failed make: *** [build/src/operator/nn/cudnn/cudnn_batch_norm_gpu.o] Error 1

avasisht-celadon commented 6 years ago

actually the errors begin with:-

/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512fintrin.h(9220): error: argument of type "const void " is incompatible with parameter of type "const float "

/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512fintrin.h(9292): error: argument of type "const void " is incompatible with parameter of type "const double "

Rest of the errors are similar.

Let me know what are the other details you would need

Bartzi commented 6 years ago

I can not help you with those compile errors :sweat_smile:, but which version of MXNet are you trying to compile?

avasisht-celadon commented 6 years ago

I am compiling the one i downloaded here:- https://github.com/apache/incubator-mxnet. 1.3 apparently

Bartzi commented 6 years ago

Please check the README of his repo again! It says that you should use version 0.9.3 of MXNet, because it is not guaranteed to work with newer versions of MXNet...

avasisht-celadon commented 6 years ago

Yes, fine. Thanks for the reply. I checked out v0.9.3, but if I "make" now, I get the error:-

Makefile:27: mshadow/make/mshadow.mk: No such file or directory Makefile:28: /home/aditya/stn-ocr/incubator-mxnet/dmlc-core/make/dmlc.mk: No such file or directory Makefile:126: /home/aditya/stn-ocr/incubator-mxnet/ps-lite/make/ps.mk: No such file or directory make: *** No rule to make target '/home/aditya/stn-ocr/incubator-mxnet/ps-lite/make/ps.mk'. Stop.

Now I re clonned it using:-

git clone --recursive

I still get the same error

avasisht-celadon commented 6 years ago

And if I dont check out v0.9.3, and do make..

I get the compilation errors stated above.

Please help. Thanks in advance

Bartzi commented 6 years ago

you'll need to also checkout the submodules :wink:

Bartzi / stn-ocr

Training does not end. #26