Closed EmbraceLife closed 7 years ago
Would you be able to simply post the Kurfile itself here (rather than an iPython notebook)? A minimal working dataset would be great, too, but rather than needing to run the notebook (and risk coming up with a different dataset than you), can you just post it here, too? Not the code, just the dataset.
I would expect the submission to look something like this (where the data entry is really short, just one or two data items):
When I use this Kurfile:
settings:
# ...
train:
# ...
with this data given to the JSONL supplier:
{'input_seq' : [1, 2, 3, ...], ...}
{'input_seq' : [4, 5, 6, ...], ...}
then I get NaN all the time.
If I understand your request correctly,
I tried to use smaller or a few lines of text, but it seems the model does not work on such a short text. So, it seems the easiest way to replicate the error is still to use the same dataset stored in kur github.
kur directory/examples/language-model/
make_data.py
and set dev == True
, to speed up everythingpython make_data.py
to save data into jsonl file{'out_char': [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'in_seq': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]]}
language-model/model/
kur -v train kurfile.yaml
under model/
directory---
settings:
vocab:
size: 30
rnn:
size: 128
depth: 3
model:
- input: in_seq
- for:
range: "{{ rnn.depth - 1 }}"
iterate:
- recurrent:
size: "{{ rnn.size }}"
type: lstm # only difference from original
sequence: yes
bidirectional: no
- batch_normalization
- recurrent:
size: "{{ rnn.size }}"
type: lstm # only difference from original
sequence: no
bidirectional: no
- dense: "{{ vocab.size }}"
- activation: softmax
- output: out_char
loss:
- target: out_char
name: categorical_crossentropy
train:
data:
- jsonl: ../data/train.jsonl
epochs: 5
weights:
initial: best.w.kur
last: last.w.kur
log: log
validate:
data:
- jsonl: ../data/validate.jsonl
weights: best.w.kur
test:
data:
- jsonl: ../data/test.jsonl
weights: best.w.kur
evaluate:
data:
- jsonl: ../data/evaluate.jsonl
weights: best.w.kur
destination: output.pkl
Epoch 1/5, loss=10.101: 0%| | 32/13300 [00:00<02:19, 95.00samples/s][ERROR 2017-03-07 21:49:15,882 kur.model.executor:647] Received NaN loss value for model output "out_char". Make sure that your inputs are all normalized and that the learning rate is not too high. Sometimes different algorithms/implementations work better than others, so you can try switching optimizers or backend.
Epoch 1/5, loss=nan: 0%| | 64/13300 [00:00<01:29, 147.18samples/s]
[ERROR 2017-03-07 21:49:15,882 kur.model.executor:227] Exception raised during training.
Traceback (most recent call last):
File "/Users/Natsume/Downloads/kur/kur/model/executor.py", line 224, in train
**kwargs
File "/Users/Natsume/Downloads/kur/kur/model/executor.py", line 648, in wrapped_train
raise ValueError('Model loss is NaN.')
ValueError: Model loss is NaN.
[INFO 2017-03-07 21:49:15,883 kur.model.executor:235] Saving most recent weights: last.w.kur
Traceback (most recent call last):
File "/Users/Natsume/miniconda2/envs/dlnd-tf-lab/bin/kur", line 11, in <module>
load_entry_point('kur', 'console_scripts', 'kur')()
File "/Users/Natsume/Downloads/kur/kur/__main__.py", line 382, in main
sys.exit(args.func(args) or 0)
File "/Users/Natsume/Downloads/kur/kur/__main__.py", line 62, in train
func(step=args.step)
File "/Users/Natsume/Downloads/kur/kur/kurfile.py", line 371, in func
return trainer.train(**defaults)
File "/Users/Natsume/Downloads/kur/kur/model/executor.py", line 224, in train
**kwargs
File "/Users/Natsume/Downloads/kur/kur/model/executor.py", line 648, in wrapped_train
raise ValueError('Model loss is NaN.')
ValueError: Model loss is NaN.
using code above, if I set settings backend as the following:
settings:
backend:
name: keras
backend: tensorflow
The error reported above will disappear and it trains, but I do receive the following report saying tensorflow was not compiled (I remember @ajsyp told me once to install tensorflow from source to deal with tensorflow compile issue, I will try it later):
(dlnd-tf-lab) ->kur -v train kurfile.yaml
[INFO 2017-03-08 11:56:56,939 kur.kurfile:699] Parsing source: kurfile.yaml, included by top-level.
[INFO 2017-03-08 11:56:56,952 kur.kurfile:82] Parsing Kurfile...
[INFO 2017-03-08 11:56:56,970 kur.loggers.binary_logger:71] Loading log data: log
[INFO 2017-03-08 11:57:00,218 kur.backend.backend:80] Creating backend: keras
[INFO 2017-03-08 11:57:00,218 kur.backend.backend:83] Backend variants: none
[INFO 2017-03-08 11:57:00,218 kur.backend.keras_backend:81] The tensorflow backend for Keras has been requested.
[INFO 2017-03-08 11:57:01,184 kur.backend.keras_backend:195] Keras is loaded. The backend is: tensorflow
[INFO 2017-03-08 11:57:01,184 kur.model.model:260] Enumerating the model containers.
[INFO 2017-03-08 11:57:01,185 kur.model.model:265] Assembling the model dependency graph.
[INFO 2017-03-08 11:57:01,185 kur.model.model:280] Connecting the model graph.
[INFO 2017-03-08 11:57:02,200 kur.model.model:284] Model inputs: in_seq
[INFO 2017-03-08 11:57:02,200 kur.model.model:285] Model outputs: out_char
[INFO 2017-03-08 11:57:02,200 kur.kurfile:357] Ignoring missing initial weights: best.w.kur. If this is undesireable, set "must_exist" to "yes" in the approriate "weights" section.
[INFO 2017-03-08 11:57:02,200 kur.model.executor:315] No historical training loss available from logs.
[INFO 2017-03-08 11:57:02,200 kur.model.executor:323] No historical validation loss available from logs.
[INFO 2017-03-08 11:57:02,200 kur.model.executor:329] No previous epochs.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
[INFO 2017-03-08 11:57:10,026 kur.backend.keras_backend:666] Waiting for model to finish compiling...
Epoch 1/5, loss=5.123: 26%|████▋ | 3488/13300 [00:29<01:23, 118.16samples/s]^C
I tried the latest development version with pytorch as backend. It is faster and no error reported above when using lstm
:
Change backend from tensorflow to pytorch, by using code below:
settings:
backend: pytorch
no error and no tensorflow compiling issue at all
Installing tensorflow from source looks complicated, is there an easier solution to tensorflow compiling issue?
I have also experience worse numerical stability with Theano than the other backends. Once the PyTorch backend is stable/tested more, it may become the default backend. So I'm glad that switching away from Theano helps with your problem.
Also, I have installed TensorFlow from source before, but I have not found it an enjoyable or simple process (frankly, bazel doesn't seem like a great build tool, in my experience). My recommendation is that unless you really are ready to dive into building from source, just use pip install tensorflow
or pip install tensorflow-gpu
(for GPU capabilities) and simply live. After all, if you truly need a big speed boost, a GPU will be worlds different than the extra CPU optimization.
Hi, The code to produce the error is found here
This notebook contains everything you need to replicate the error I am facing.
Thanks