Open avasisht-celadon opened 6 years ago
Do you have a GPU in your machine? Right now you are running on CPU... that is definitely the reason why 'nothing' is happening...
Yes sir, absolutely right. I have a GPU but had not enabled "USE_CUDA" flag in config.mk of "incubator-mxnet". I am recompiling the mxnet repo with "USE_CUDA = 1",
It threw a error:-
92 errors detected in the compilation of "/tmp/tmpxft_00002feb_0000000 0-12_cudnn_batch_norm.compute_70.cpp1.ii". Makefile:465: recipe for target 'build/src/operator/nn/cudnn/cudnn_bat ch_norm_gpu.o' failed make: *** [build/src/operator/nn/cudnn/cudnn_batch_norm_gpu.o] Error 1
Now I enabled "USE_CUDNN=1" in make/config.mk
I get an error:-
92 errors detected in the compilation of "/tmp/tmpxft_00003084_00000000-12_cudnn_batch_norm.compute_70.cpp1.ii". Makefile:465: recipe for target 'build/src/operator/nn/cudnn/cudnn_batch_norm_gpu.o' failed make: *** [build/src/operator/nn/cudnn/cudnn_batch_norm_gpu.o] Error 1
Now I enable "USE_NCCL = 1" and give path "USE_NCCL_PATH = /usr/local/cuda/lib64",
I get an error:-
92 errors detected in the compilation of "/tmp/tmpxft_00003084_00000000-12_cudnn_batch_norm.compute_70.cpp1.ii". Makefile:465: recipe for target 'build/src/operator/nn/cudnn/cudnn_batch_norm_gpu.o' failed make: *** [build/src/operator/nn/cudnn/cudnn_batch_norm_gpu.o] Error 1
actually the errors begin with:-
/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512fintrin.h(9220): error: argument of type "const void " is incompatible with parameter of type "const float "
/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512fintrin.h(9292): error: argument of type "const void " is incompatible with parameter of type "const double "
Rest of the errors are similar.
Let me know what are the other details you would need
I can not help you with those compile errors :sweat_smile:, but which version of MXNet are you trying to compile?
I am compiling the one i downloaded here:- https://github.com/apache/incubator-mxnet. 1.3 apparently
Please check the README of his repo again! It says that you should use version 0.9.3
of MXNet, because it is not guaranteed to work with newer versions of MXNet...
Yes, fine. Thanks for the reply. I checked out v0.9.3, but if I "make" now, I get the error:-
Makefile:27: mshadow/make/mshadow.mk: No such file or directory Makefile:28: /home/aditya/stn-ocr/incubator-mxnet/dmlc-core/make/dmlc.mk: No such file or directory Makefile:126: /home/aditya/stn-ocr/incubator-mxnet/ps-lite/make/ps.mk: No such file or directory make: *** No rule to make target '/home/aditya/stn-ocr/incubator-mxnet/ps-lite/make/ps.mk'. Stop.
Now I re clonned it using:-
git clone --recursive
I still get the same error
And if I dont check out v0.9.3, and do make..
I get the compilation errors stated above.
Please help. Thanks in advance
you'll need to also checkout the submodules :wink:
I have issued the command for training (svhn) as per the instructions. It does not progress at all. ########################################################################## Command : python train_svhn.py /home/aditya/stn-ocr/generated/centered/train.csv /home/aditya/stn-ocr/generated/centered/valid.csv --log-dir /home/aditya/stn-ocr -b 400 --lr 1e-5
/home/aditya/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses import imp loading data 2018-10-29 13:53:20,201 Node[0] start with arguments Namespace(batch_size=400, blank_label=0, char_map=None, checkpoint_interval=None, eval_image=None, fix_loc=False, gif=False, gpus=None, ip=None, kv_store='local', load_epoch=None, log_dir='/home/aditya/stn-ocr/2018-10-29T13:53:16.415078_training', log_file='/home/aditya/stn-ocr/2018-10-29T13:53:16.415078_training/log', log_level='INFO', log_name='training', lr=1e-05, lr_factor=1, lr_factor_epoch=1, model_prefix=None, num_epochs=10, plot_network_graph=False, port=1337, progressbar=False, save_model_prefix=None, send_bboxes=False, train_file='/home/aditya/stn-ocr/generated/centered/train.csv', val_file='/home/aditya/stn-ocr/generated/centered/valid.csv', video=False, zoom=0.9) 2018-10-29 13:53:20,202 Node[0] EPOCH SIZE: 250 2018-10-29 13:53:20,226 Node[0] Start training with [cpu(0)]
############################################################################
It stops right there. No progress.