deepgram / kur

Descriptive Deep Learning
Apache License 2.0
816 stars 107 forks source link

`lstm` produce ValueError: Model loss is NaN. #31

Closed EmbraceLife closed 7 years ago

EmbraceLife commented 7 years ago

Hi, The code to produce the error is found here

This notebook contains everything you need to replicate the error I am facing.

Thanks

ajsyp commented 7 years ago

Would you be able to simply post the Kurfile itself here (rather than an iPython notebook)? A minimal working dataset would be great, too, but rather than needing to run the notebook (and risk coming up with a different dataset than you), can you just post it here, too? Not the code, just the dataset.

I would expect the submission to look something like this (where the data entry is really short, just one or two data items):

When I use this Kurfile:

settings:
  # ...
train:
  # ...

with this data given to the JSONL supplier:

{'input_seq' : [1, 2, 3, ...], ...}
{'input_seq' : [4, 5, 6, ...], ...}

then I get NaN all the time.

EmbraceLife commented 7 years ago

If I understand your request correctly,

Dataset

I tried to use smaller or a few lines of text, but it seems the model does not work on such a short text. So, it seems the easiest way to replicate the error is still to use the same dataset stored in kur github.

Preprocess dataset and save into JSONL file

{'out_char': [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
'in_seq': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]]}

Kurfile

---

settings:

  vocab:
    size: 30

  rnn:                              
    size: 128        
    depth: 3

model:
  - input: in_seq

  - for:
      range: "{{ rnn.depth - 1 }}"
      iterate:
        - recurrent:
            size: "{{ rnn.size }}"
            type: lstm                                     # only difference from original
            sequence: yes
            bidirectional: no
        - batch_normalization

  - recurrent:
      size: "{{ rnn.size }}"
      type: lstm                                          # only difference from original
      sequence: no
      bidirectional: no

  - dense: "{{ vocab.size }}"

  - activation: softmax

  - output: out_char

loss:
  - target: out_char
    name: categorical_crossentropy

train:
  data:
    - jsonl: ../data/train.jsonl
  epochs: 5
  weights:
    initial: best.w.kur
    last: last.w.kur

  log: log

validate:
  data:
    - jsonl: ../data/validate.jsonl
  weights: best.w.kur

test:
  data:
    - jsonl: ../data/test.jsonl
  weights: best.w.kur

evaluate:
  data:
    - jsonl: ../data/evaluate.jsonl
  weights: best.w.kur

  destination: output.pkl

Error I always get

Epoch 1/5, loss=10.101:   0%| | 32/13300 [00:00<02:19, 95.00samples/s][ERROR 2017-03-07 21:49:15,882 kur.model.executor:647] Received NaN loss value for model output "out_char". Make sure that your inputs are all normalized and that the learning rate is not too high. Sometimes different algorithms/implementations work better than others, so you can try switching optimizers or backend.
Epoch 1/5, loss=nan:   0%| | 64/13300 [00:00<01:29, 147.18samples/s]
[ERROR 2017-03-07 21:49:15,882 kur.model.executor:227] Exception raised during training.
Traceback (most recent call last):
  File "/Users/Natsume/Downloads/kur/kur/model/executor.py", line 224, in train
    **kwargs
  File "/Users/Natsume/Downloads/kur/kur/model/executor.py", line 648, in wrapped_train
    raise ValueError('Model loss is NaN.')
ValueError: Model loss is NaN.
[INFO 2017-03-07 21:49:15,883 kur.model.executor:235] Saving most recent weights: last.w.kur
Traceback (most recent call last):
  File "/Users/Natsume/miniconda2/envs/dlnd-tf-lab/bin/kur", line 11, in <module>
    load_entry_point('kur', 'console_scripts', 'kur')()
  File "/Users/Natsume/Downloads/kur/kur/__main__.py", line 382, in main
    sys.exit(args.func(args) or 0)
  File "/Users/Natsume/Downloads/kur/kur/__main__.py", line 62, in train
    func(step=args.step)
  File "/Users/Natsume/Downloads/kur/kur/kurfile.py", line 371, in func
    return trainer.train(**defaults)
  File "/Users/Natsume/Downloads/kur/kur/model/executor.py", line 224, in train
    **kwargs
  File "/Users/Natsume/Downloads/kur/kur/model/executor.py", line 648, in wrapped_train
    raise ValueError('Model loss is NaN.')
ValueError: Model loss is NaN.
EmbraceLife commented 7 years ago

using code above, if I set settings backend as the following:

settings: 
  backend: 
    name: keras
    backend: tensorflow

The error reported above will disappear and it trains, but I do receive the following report saying tensorflow was not compiled (I remember @ajsyp told me once to install tensorflow from source to deal with tensorflow compile issue, I will try it later):

(dlnd-tf-lab)  ->kur -v train kurfile.yaml
[INFO 2017-03-08 11:56:56,939 kur.kurfile:699] Parsing source: kurfile.yaml, included by top-level.
[INFO 2017-03-08 11:56:56,952 kur.kurfile:82] Parsing Kurfile...
[INFO 2017-03-08 11:56:56,970 kur.loggers.binary_logger:71] Loading log data: log
[INFO 2017-03-08 11:57:00,218 kur.backend.backend:80] Creating backend: keras
[INFO 2017-03-08 11:57:00,218 kur.backend.backend:83] Backend variants: none
[INFO 2017-03-08 11:57:00,218 kur.backend.keras_backend:81] The tensorflow backend for Keras has been requested.
[INFO 2017-03-08 11:57:01,184 kur.backend.keras_backend:195] Keras is loaded. The backend is: tensorflow
[INFO 2017-03-08 11:57:01,184 kur.model.model:260] Enumerating the model containers.
[INFO 2017-03-08 11:57:01,185 kur.model.model:265] Assembling the model dependency graph.
[INFO 2017-03-08 11:57:01,185 kur.model.model:280] Connecting the model graph.
[INFO 2017-03-08 11:57:02,200 kur.model.model:284] Model inputs:  in_seq
[INFO 2017-03-08 11:57:02,200 kur.model.model:285] Model outputs: out_char
[INFO 2017-03-08 11:57:02,200 kur.kurfile:357] Ignoring missing initial weights: best.w.kur. If this is undesireable, set "must_exist" to "yes" in the approriate "weights" section.
[INFO 2017-03-08 11:57:02,200 kur.model.executor:315] No historical training loss available from logs.
[INFO 2017-03-08 11:57:02,200 kur.model.executor:323] No historical validation loss available from logs.
[INFO 2017-03-08 11:57:02,200 kur.model.executor:329] No previous epochs.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
[INFO 2017-03-08 11:57:10,026 kur.backend.keras_backend:666] Waiting for model to finish compiling...

Epoch 1/5, loss=5.123:  26%|████▋             | 3488/13300 [00:29<01:23, 118.16samples/s]^C
EmbraceLife commented 7 years ago

change backend to remove error and compiling issue

I tried the latest development version with pytorch as backend. It is faster and no error reported above when using lstm:

Change backend from tensorflow to pytorch, by using code below:

settings: 
  backend: pytorch

no error and no tensorflow compiling issue at all

Install tensorflow from source

Installing tensorflow from source looks complicated, is there an easier solution to tensorflow compiling issue?

ajsyp commented 7 years ago

I have also experience worse numerical stability with Theano than the other backends. Once the PyTorch backend is stable/tested more, it may become the default backend. So I'm glad that switching away from Theano helps with your problem.

Also, I have installed TensorFlow from source before, but I have not found it an enjoyable or simple process (frankly, bazel doesn't seem like a great build tool, in my experience). My recommendation is that unless you really are ready to dive into building from source, just use pip install tensorflow or pip install tensorflow-gpu (for GPU capabilities) and simply live. After all, if you truly need a big speed boost, a GPU will be worlds different than the extra CPU optimization.