Training Issue in new corpus

anupamjamatia commented 5 years ago

Hi, I am facing problem to train the ELMo in the new corpus I run the instructions given in the [https://github.com/allenai/bilm-tf] in my system which has a single GPU and getting the killed signal. Any solution for the problem?

(ELMO) anupam@anupam-OMEN-HP:~/Desktop/ElMo/bilm$ python bin/train_elmo.py --train_prefix='/home/anupam/Desktop/ElMo/bilm/training_file/*' --vocab_file en-bn-hi_mixed_voc_file.txt --save_dir out/ Found 1 shards at /home/anupam/Desktop/ElMo/bilm/training_file/* Loading data from: /home/anupam/Desktop/ElMo/bilm/training_file/Training_File.txt Loaded 667648 sentences. Finished loading Found 1 shards at /home/anupam/Desktop/ElMo/bilm/training_file/* Loading data from: /home/anupam/Desktop/ElMo/bilm/training_file/Training_File.txt Loaded 667648 sentences. Finished loading USING SKIP CONNECTIONS [['global_step:0', TensorShape([])], ['lm/CNN/W_cnn_0:0', TensorShape([Dimension(1), Dimension(1), Dimension(16), Dimension(32)])], ['lm/CNN/W_cnn_1:0', TensorShape([Dimension(1), Dimension(2), Dimension(16), Dimension(32)])], ['lm/CNN/W_cnn_2:0', TensorShape([Dimension(1), Dimension(3), Dimension(16), Dimension(64)])], ['lm/CNN/W_cnn_3:0', TensorShape([Dimension(1), Dimension(4), Dimension(16), Dimension(128)])], ['lm/CNN/W_cnn_4:0', TensorShape([Dimension(1), Dimension(5), Dimension(16), Dimension(256)])], ['lm/CNN/W_cnn_5:0', TensorShape([Dimension(1), Dimension(6), Dimension(16), Dimension(512)])], ['lm/CNN/W_cnn_6:0', TensorShape([Dimension(1), Dimension(7), Dimension(16), Dimension(1024)])], ['lm/CNN/b_cnn_0:0', TensorShape([Dimension(32)])], ['lm/CNN/b_cnn_1:0', TensorShape([Dimension(32)])], ['lm/CNN/b_cnn_2:0', TensorShape([Dimension(64)])], ['lm/CNN/b_cnn_3:0', TensorShape([Dimension(128)])], ['lm/CNN/b_cnn_4:0', TensorShape([Dimension(256)])], ['lm/CNN/b_cnn_5:0', TensorShape([Dimension(512)])], ['lm/CNN/b_cnn_6:0', TensorShape([Dimension(1024)])], ['lm/CNN_high_0/W_carry:0', TensorShape([Dimension(2048), Dimension(2048)])], ['lm/CNN_high_0/W_transform:0', TensorShape([Dimension(2048), Dimension(2048)])], ['lm/CNN_high_0/b_carry:0', TensorShape([Dimension(2048)])], ['lm/CNN_high_0/b_transform:0', TensorShape([Dimension(2048)])], ['lm/CNN_high_1/W_carry:0', TensorShape([Dimension(2048), Dimension(2048)])], ['lm/CNN_high_1/W_transform:0', TensorShape([Dimension(2048), Dimension(2048)])], ['lm/CNN_high_1/b_carry:0', TensorShape([Dimension(2048)])], ['lm/CNN_high_1/b_transform:0', TensorShape([Dimension(2048)])], ['lm/CNN_proj/W_proj:0', TensorShape([Dimension(2048), Dimension(512)])], ['lm/CNN_proj/b_proj:0', TensorShape([Dimension(512)])], ['lm/RNN_0/rnn/multi_rnn_cell/cell_0/lstm_cell/bias:0', TensorShape([Dimension(16384)])], ['lm/RNN_0/rnn/multi_rnn_cell/cell_0/lstm_cell/kernel:0', TensorShape([Dimension(1024), Dimension(16384)])], ['lm/RNN_0/rnn/multi_rnn_cell/cell_0/lstm_cell/projection/kernel:0', TensorShape([Dimension(4096), Dimension(512)])], ['lm/RNN_0/rnn/multi_rnn_cell/cell_1/lstm_cell/bias:0', TensorShape([Dimension(16384)])], ['lm/RNN_0/rnn/multi_rnn_cell/cell_1/lstm_cell/kernel:0', TensorShape([Dimension(1024), Dimension(16384)])], ['lm/RNN_0/rnn/multi_rnn_cell/cell_1/lstm_cell/projection/kernel:0', TensorShape([Dimension(4096), Dimension(512)])], ['lm/RNN_1/rnn/multi_rnn_cell/cell_0/lstm_cell/bias:0', TensorShape([Dimension(16384)])], ['lm/RNN_1/rnn/multi_rnn_cell/cell_0/lstm_cell/kernel:0', TensorShape([Dimension(1024), Dimension(16384)])], ['lm/RNN_1/rnn/multi_rnn_cell/cell_0/lstm_cell/projection/kernel:0', TensorShape([Dimension(4096), Dimension(512)])], ['lm/RNN_1/rnn/multi_rnn_cell/cell_1/lstm_cell/bias:0', TensorShape([Dimension(16384)])], ['lm/RNN_1/rnn/multi_rnn_cell/cell_1/lstm_cell/kernel:0', TensorShape([Dimension(1024), Dimension(16384)])], ['lm/RNN_1/rnn/multi_rnn_cell/cell_1/lstm_cell/projection/kernel:0', TensorShape([Dimension(4096), Dimension(512)])], ['lm/char_embed:0', TensorShape([Dimension(261), Dimension(16)])], ['lm/softmax/W:0', TensorShape([Dimension(1525043), Dimension(512)])], ['lm/softmax/b:0', TensorShape([Dimension(1525043)])], ['train_perplexity:0', TensorShape([])]] WARNING:tensorflow:From /home/anupam/.local/lib/python3.5/site-packages/tensorflow/python/util/tf_should_use.py:170: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02. Instructions for updating: Usetf.global_variables_initializerinstead. 2019-02-08 17:07:46.692619: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 2019-02-08 17:07:46.692636: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 2019-02-08 17:07:46.692641: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2019-02-08 17:07:46.692645: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 2019-02-08 17:07:46.692649: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. Killed (ELMO) anupam@anupam-OMEN-HP:~/Desktop/ElMo/bilm$ GPU can be viewd here

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1665 G /usr/lib/xorg/Xorg 837MiB | | 0 3429 G compiz 50MiB | | 0 4058 G ...quest-channel-token=8843362536060243018 55MiB | | 0 4726 G ...quest-channel-token=6712252360503423866 13MiB | | 0 7177 G /usr/lib/thunderbird/thunderbird 2MiB | | 0 9805 G ...-token=82F8B3D416718D1A9486CF4518D0A1FF 4MiB | +-----------------------------------------------------------------------------+ `

hichiaty commented 5 years ago

You are using CUDA 10.0 which means that you are using a tensorflow version != 1.2 which is only compatible with CUDA 8.0.

anupamjamatia commented 5 years ago

So is not it possible to run the code in my system which has the following configuration

`$ pip list Package Version

absl-py 0.6.1
astor 0.7.1
backcall 0.1.0
backports.weakref 1.0rc1
bilm 0.1.post5 bleach 1.5.0
certifi 2018.11.29 cycler 0.10.0
decorator 4.3.0
entrypoints 0.2.3
enum34 1.1.6
gast 0.2.0
grpcio 1.17.1
h5py 2.9.0
html5lib 0.9999999 ipykernel 5.1.0
ipython 7.2.0
ipython-genutils 0.2.0
ipywidgets 7.4.2
jedi 0.13.2
Jinja2 2.10
jsonschema 2.6.0
jupyter 1.0.0
jupyter-client 5.2.4
jupyter-console 6.0.0
jupyter-core 4.4.0
Keras 2.2.4
Keras-Applications 1.0.7
keras-metrics 0.0.5
Keras-Preprocessing 1.0.5
kiwisolver 1.0.1
Markdown 2.2.0
MarkupSafe 1.1.0
matplotlib 3.0.2
mistune 0.8.4
mkl-fft 1.0.6
mkl-random 1.0.2
mock 2.0.0
nbconvert 5.3.1
nbformat 4.4.0
nltk 3.4
notebook 5.7.4
numpy 1.16.1
pandas 0.23.4
pandas-ml 0.5.0
pandocfilters 1.4.2
parso 0.3.1
pbr 5.1.1
pexpect 4.6.0
pickleshare 0.7.5
pip 19.0.1
prometheus-client 0.5.0
prompt-toolkit 2.0.7
protobuf 3.6.1
ptyprocess 0.6.0
pydot-ng 2.0.0
Pygments 2.3.1
pyparsing 2.3.1
Pyphen 0.9.5
python-dateutil 2.7.5
pytz 2018.7
PyYAML 3.13
pyzmq 17.1.2
qtconsole 4.4.3
scikit-learn 0.20.2
scipy 1.2.0
seaborn 0.9.0
Send2Trash 1.5.0
setuptools 40.6.3
singledispatch 3.4.0.3
six 1.12.0
sklearn 0.0
tensorboard 1.12.2
tensorflow 1.12.0
tensorflow-gpu 1.2.0
termcolor 1.1.0
terminado 0.8.1
testpath 0.4.2
textblob 0.15.2
Theano 1.0.3
tornado 5.1.1
traitlets 4.3.2
wcwidth 0.1.7
webencodings 0.5.1
Werkzeug 0.14.1
wheel 0.32.3
widgetsnbextension 3.4.2
You are using pip version 19.0.1, however version 19.0.2 is available. You should consider upgrading via the 'pip install --upgrade pip' command.`

allenai / bilm-tf

Training Issue in new corpus #165