Closed xudekuan closed 6 years ago
Of course! Hope you get a good result :-)
Thank you very much! When we used it on GPU, erros occuered:
Using gpu device 0: TITAN Xp (CNMeM is enabled with initial size: 90.0% of memory, cuDNN 6021) /usr/local/lib/python2.7/dist-packages/theano/sandbox/cuda/init.py:600: UserWarning: Your cuDNN version is more recent than the one Theano officially supports. If you see any problems, try updating Theano or downgrading cuDNN to version 5. warnings.warn(warn) [13 Sep 15:57:39 INFO] =====config===== [13 Sep 15:57:39 INFO] "maxout": 2 [13 Sep 15:57:39 INFO] "index_unk_trg": 1 [13 Sep 15:57:39 INFO] "num_vocab_src": 30001 [13 Sep 15:57:39 INFO] "try_iter": 100000 [13 Sep 15:57:39 INFO] "clip": 1.0 [13 Sep 15:57:39 INFO] "sort_batches": 20 [13 Sep 15:57:39 INFO] "trg_mono_shuf": [13 Sep 15:57:39 INFO] "batchsize": 80 [13 Sep 15:57:39 INFO] "sampleN": 100 [13 Sep 15:57:39 INFO] "src_text": data/guwentrain.src [13 Sep 15:57:39 INFO] "index_eos_src": 30000 [13 Sep 15:57:39 INFO] "trg_mono": [13 Sep 15:57:39 INFO] "auto_lambda_1": 1 [13 Sep 15:57:39 INFO] "MRT_alpha": 0.005 [13 Sep 15:57:39 INFO] "dim_emb_trg": 620 [13 Sep 15:57:39 INFO] "beta2_adam": 0.999 [13 Sep 15:57:39 INFO] "test_ref": [] [13 Sep 15:57:39 INFO] "auto_lambda_2": 10 [13 Sep 15:57:39 INFO] "trg_text": data/guwentrain.trg [13 Sep 15:57:39 INFO] "checkpoint_model": checkpoint_model.npz [13 Sep 15:57:39 INFO] "sample_length": 50 [13 Sep 15:57:39 INFO] "semi_sampleN": 10 [13 Sep 15:57:39 INFO] "alphadecay_adam": 0.998 [13 Sep 15:57:39 INFO] "save_freq": 2000 [13 Sep 15:57:39 INFO] "reconstruct_lambda": 0.1 [13 Sep 15:57:39 INFO] "n_samples": 1 [13 Sep 15:57:39 INFO] "lr": 1.0 [13 Sep 15:57:39 INFO] "save_path": models [13 Sep 15:57:39 INFO] "dim_rec_enc": 1000 [13 Sep 15:57:39 INFO] "eps_adam": 1e-08 [13 Sep 15:57:39 INFO] "save": True [13 Sep 15:57:39 INFO] "src_mono": [13 Sep 15:57:39 INFO] "data_corpus": json [13 Sep 15:57:39 INFO] "src_mono_shuf": [13 Sep 15:57:39 INFO] "alpha_adam": 0.0005 [13 Sep 15:57:39 INFO] "valid_dir": validation [13 Sep 15:57:39 INFO] "optimizer": adam_slowstart [13 Sep 15:57:39 INFO] "sample_sentence": [13 Sep 15:57:39 INFO] "dim_emb_src": 620 [13 Sep 15:57:39 INFO] "MRT": False [13 Sep 15:57:39 INFO] "epsilon": 1e-06 [13 Sep 15:57:39 INFO] "max_iter": 1000000 [13 Sep 15:57:39 INFO] "data_vocab": cPickle [13 Sep 15:57:39 INFO] "index_unk_src": 1 [13 Sep 15:57:39 INFO] "valid_src": data/guwenvalid.src [13 Sep 15:57:39 INFO] "src_shuf": corpus/train.zh.json.shuf [13 Sep 15:57:39 INFO] "trg_shuf": corpus/train.en.json.shuf [13 Sep 15:57:39 INFO] "LenRatio": 1.5 [13 Sep 15:57:39 INFO] "rho": 0.95 [13 Sep 15:57:39 INFO] "ivocab_src": corpus/ivocab.zh.pkl [13 Sep 15:57:39 INFO] "checkpoint_freq": 2000 [13 Sep 15:57:39 INFO] "sample_times": 1 [13 Sep 15:57:39 INFO] "trg_mono_text": [13 Sep 15:57:39 INFO] "dim_rec_dec": 1000 [13 Sep 15:57:39 INFO] "src": corpus/train.zh.json [13 Sep 15:57:39 INFO] "src_mono_text": [13 Sep 15:57:39 INFO] "test_src": [] [13 Sep 15:57:39 INFO] "beta1_adam": 0.9 [13 Sep 15:57:39 INFO] "semi_learning": False [13 Sep 15:57:39 INFO] "index_eos_trg": 30000 [13 Sep 15:57:39 INFO] "sample_freq": 100 [13 Sep 15:57:39 INFO] "valid_ref": data/guwenvalid.trg [13 Sep 15:57:39 INFO] "verbose_level": info [13 Sep 15:57:39 INFO] "num_vocab_trg": 30001 [13 Sep 15:57:39 INFO] "trg": corpus/train.en.json [13 Sep 15:57:39 INFO] "ivocab_trg": corpus/ivocab.en.pkl [13 Sep 15:57:39 INFO] "test_dir": eval [13 Sep 15:57:39 INFO] "beam_size": 10 [13 Sep 15:57:39 INFO] "checkpoint_status": checkpoint_status.pkl [13 Sep 15:57:39 INFO] "vocab_trg": corpus/vocab.en.pkl [13 Sep 15:57:39 INFO] "init_model": [13 Sep 15:57:39 INFO] "maxlength": 50 [13 Sep 15:57:39 INFO] "model": RNNsearch [13 Sep 15:57:39 INFO] "vocab_src": corpus/vocab.zh.pkl [13 Sep 15:57:39 INFO] "sample_num": 10 [13 Sep 15:57:39 INFO] [13 Sep 15:57:39 INFO] STEP 2: Training [13 Sep 15:57:39 INFO] STEP 2.1: Loading training data [13 Sep 15:57:39 INFO] total 12 sentences [13 Sep 15:57:39 INFO] Discarding long sentences. 12 sentences left. [13 Sep 15:57:39 INFO] Done!
[13 Sep 15:57:39 INFO] STEP 2.2: Building model [13 Sep 15:57:39 INFO] Initializing layers [13 Sep 15:57:45 INFO] Building computational graph [13 Sep 15:57:45 INFO] Done!
mod.cu(67): error: identifier "cudnnSetFilterNdDescriptor_v4" is undefined mod.cu(16): warning: function "c_set_tensorNd" was declared but never referenced mod.cu(60): warning: function "c_set_filterNd" was declared but never referenced 1 error detected in the compilation of "/tmp/tmpxft_00000936_00000000-9_mod.cpp1.ii".
['/usr/local/cuda-8.0/bin/nvcc', '-shared', '-O3', '-Xlinker', '-rpath,/usr/local/cuda-8.0/lib64', '-use_fast_math', '-arch=sm_61', '-m64', '-Xcompiler', '-fno-math-errno,-Wno-unused-label,-Wno-unused-variable,-Wno-write-strings,-DCUDA_NDARRAY_CUH=c72d035fdf91890f3b36710688069b2e,-DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION,-fPIC,-fvisibility=hidden', '-Xlinker', '-rpath,/home/dekuanxu/.theano/compiledir_Linux-4.8--generic-x86_64-with-Ubuntu-16.04-xenial-x86_64-2.7.12-64/cuda_ndarray', '-I/home/dekuanxu/.theano/compiledir_Linux-4.8--generic-x86_64-with-Ubuntu-16.04-xenial-x86_64-2.7.12-64/cuda_ndarray', '-I/usr/local/cuda-8.0/include', '-I/usr/local/lib/python2.7/dist-packages/theano/sandbox/cuda', '-I/home/dekuanxu/.local/lib/python2.7/site-packages/numpy/core/include', '-I/usr/include/python2.7', '-I/usr/local/lib/python2.7/dist-packages/theano/gof', '-o', '/home/dekuanxu/.theano/compiledir_Linux-4.8--generic-x86_64-with-Ubuntu-16.04-xenial-x86_64-2.7.12-64/tmpaVEUXH/6c2407cd47903371a6adb2201001b071.so', 'mod.cu', '-L/home/dekuanxu/.theano/compiledir_Linux-4.8--generic-x86_64-with-Ubuntu-16.04-xenial-x86_64-2.7.12-64/cuda_ndarray', '-L/usr/lib', '-lcudart', '-lcublas', '-lcuda_ndarray', '-lcudnn', '-lpython2.7']
Traceback (most recent call last):
File "data/thumt/train.py", line 69, in
We used cuda 8.
Cuda 8.0 is ok. It seems that cuDNN version may cause the problem. Can you provide the version of theano? We need to reproduce the error in order to fix it.
THEANO is 0.8.2 installed after the instuction of section 2.2, Page 2 on your manual. Should I remove cuDNN 6021 and install cuDNN 5 to solve the problem? Our system is on a Server, and Tensorflow needs to use cuDNN 6021, maybe it will cause problem if I remove cuDNN 6021 .
我问了一下服务器管理员,他说GPU出错,可能是硬件问题 NVIDIA-SMI 384.69 Driver Version: 384.69 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 TITAN Xp Off | 00000000:02:00.0 Off | N/A | | 23% 40C P0 61W / 250W | 0MiB / 12189MiB | 0% Default |
I cannot give a stable solution for now. Downgrading cuDNN to 5 or upgrading Theano may work. I will leave this issue open until we or someone else get a solution.
Is it possible to use your THUMT to take part in a tranlation cometition?