deepgram / kur

Descriptive Deep Learning
Apache License 2.0
816 stars 107 forks source link

Error when training on custom data #99

Closed hisoyeah closed 5 years ago

hisoyeah commented 6 years ago

Hello,

I'm trying to train on my custom data but I got a weird issue.

2018-08-30 17:25:46.106572: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at ctc_loss_op.cc:166 : Invalid argument: Saw a non-null label (index >= num_classes - 1) following a null label, batch: 0 num_classes: 29 labels: [ERROR 2018-08-30 17:25:46,434 kur.model.executor:352] Exception raised during training.

You will find attached my file with labels. My speech.yml is the same as the one in examples.

Thank a lot for helping

hisoyeah commented 6 years ago

Here the links of my labels (train and test) : train : https://framadrop.org/r/x7PT4ZWWDa#qScNABNoCAOPMwcaG66Z2ntcXSHzYPq+V7SOIZRoy3c=

test : https://framadrop.org/r/NrsDfJLonQ#PZ9WcSCgKd354wfmXL+HwX4gruo5yT5aflwNeoeQ4cA=

noajshu commented 5 years ago

I am having the same issue with Kur. In particular, I have exported the SCOTUS-speech corpus to kur format, and get this error when I try to train the standard speech.yml file

noajshu commented 5 years ago

here is a preview of my corpus dir:

$ ls scotusspeech-wav/audio/
0002dc2e-fb42-4055-8716-30884da750bf.wav    4e0c359b-bda0-4ac5-8ba9-2cb191026871.wav    ae446f59-a989-4159-8fae-725834c90007.wav
05c966f8-cf4d-4b35-878e-e9318950d0f3.wav    505d045c-20b5-41dd-8658-b1beed223688.wav    ...
$ head scotusspeech-wav/scotusspeech-test.jsonl 
{"uuid": "0e4f2165-19e3-4488-98c3-54c6c9d6e77a", "duration_s": 6.52, "text": "we will hear argument first this morning in case 105400 tapia v"}
{"uuid": "384c69c7-bebf-479c-a447-8f123419e633", "duration_s": 2.28, "text": "united states mr cahn"}
{"uuid": "ad923318-f48a-41bd-8a6d-8169c2f5c7c8", "duration_s": 22.92, "text": "mr chief justice and may it please the court when it instructed courts to recognize that imprisonment is not an appropriate means of promoting correction and rehabilitation congress intended to end the practice of sending defendants to prison so that they might get treatment"}
{"uuid": "13cb75be-4f23-472f-a264-ef3cd582eda8", "duration_s": 9.76, "text": "the commands of 3582 are clear on this point do not imprison and do not lengthen prison sentences for the purposes of rehabilitation"}
{"uuid": "fe815d5d-52a7-4b6b-8cab-baf73f717a8b", "duration_s": 4.52, "text": "this plain meaning is confirmed by the structure of the statute"}
{"uuid": "e5e8594d-1589-49e4-a86d-6e1e97040822", "duration_s": 6.36, "text": "under the statute judges have the power to sentence defendants to prison but not to prison programs"}
{"uuid": "35ae41ef-b799-4f48-afbe-75d2017b4a67", "duration_s": 7.6, "text": "judges once had that power under the youth corrections act and under the narcotic addicts rehabilitation act"}
{"uuid": "43fc133f-1eab-4beb-9c9a-2bba41c39481", "duration_s": 3.64, "text": "with the sentencing reform act congress took that power away"}
{"uuid": "4d663259-4098-4195-bbdf-099f1df3d1c2", "duration_s": 9.08, "text": "that structure makes sense only because congress intended that defendants should no longer be sent to prison for purposes of rehabilitation"}
{"uuid": "562baf9b-67c4-4d42-9651-2831026f049d", "duration_s": 2.72, "text": "you have in effect a oneway ratchet"}
Noahs-MacBook:speechrec noaj$ soxi scotusspeech-wav/audio/0e4f2165-19e3-4488-98c3-54c6c9d6e77a.wav 

Input File     : 'scotusspeech-wav/audio/0e4f2165-19e3-4488-98c3-54c6c9d6e77a.wav'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:00:06.52 = 104320 samples ~ 489 CDDA sectors
File Size      : 209k
Bit Rate       : 256k
Sample Encoding: 16-bit Signed Integer PCM
noajshu commented 5 years ago

here is my entire Kurfile:

---

###############################################################################
# Pro-tips:
#
# - Use YAML's anchors! This let's you define a value (even a dictionary), and
#   reuse, or even change, parts of it later. YAML anchors are incredible for
#   defining constant values that you want to reuse all over the place. You can
#   define an anchor like this:
#     KEY: &my_anchor
#   Note that the value of KEY can be anything. All of these are allowed:
#     KEY: &my_anchor "my value"
#     KEY: &my_ancor
#       A: B
#       C: D
#     KEY: &my_anchor [1, 2, 3]
#   You can then refer back to your anchors like this:
#     ANOTHER_KEY: *my_anchor
#   This sets the value of ANOTHER_KEY to be the same thing as the original
#   KEY. Now let's say that your anchor is a dictionary, but you want to refer
#   to it with modified values later. Try this:
#     KEY: &my_anchor
#       FIRST: VALUE_1
#       SECOND: VALUE_2
#     ANOTHER_KEY:
#       <<: *my_anchor
#       SECOND: VALUE_2_NEW
#       THIRD: VALUE_3
#     MORE_KEY: *my_anchor
#   These are 100% equivalent to this more verbose structure:
#     KEY:
#       FIRST: VALUE_1
#       SECOND: VALUE_2
#     ANOTHER_KEY:
#       FIRST: VALUE_1
#       SECOND: VALUE_2_NEW
#       THIRD: VALUE_3
#     MORE_KEY:
#       FIRST: VALUE_1
#       SECOND: VALUE_2
#
# - Use the Jinja2 engine! It is really powerful, and it's most appropriately
#   used to do on-the-fly interpretation/evaluation of values in the "model"
#   section of the Kurfile.
#
# - So how do you know when to use YAML anchors as opposed to Jinja2
#   expressions? Here are some tips.
#
#   YAML anchors only work within a single YAML file, and are evaluated the
#   moment the file is loaded. This means you can't use YAML anchors from a
#   JSON Kurfile, and you can't reference anchors in other Kurfiles.
#
#   Jinja2 is interpreted after all Kurfiles are loaded, which means that
#   many different Kurfiles can share variables via Jinja2. Jinja2
#   expressions can also be used in JSON Kurfiles.
#
#   It's almost like YAML anchors are "compile-time constants" but Jinja2
#   expressions are interpreted at run-time. As a result, the value of a
#   Jinja2 expression could be different at different points in the
#   Kurfile (e.g., if you use Jinja2 to reference the previous layer in a
#   model, obviously the interpretation/value of "previous layer" resolves
#   to something different for the second layer in the model as compared to the
#   fifth layer in the model.

###############################################################################
settings:

  # Deep learning model
  cnn:
    kernels: 1000
    size: 11
    stride: 2
  rnn:
    size: 1000
    depth: 3
  vocab:
    # Need for CTC
    size: 28

  # Setting up the backend.
  backend:
    name: keras
    backend: tensorflow

  # Batch sizes
  provider: &provider
    batch_size: 16
    force_batch_size: yes

  # Where to put the data.
  data: &data
    #path: "~/projects/speechrec/lsdc-test/"
    path: "../../scotusspeech-wav/"
    type: spec
    max_duration: 50
    max_frequency: 8000
    normalization: norm.yml

  # Where to put the weights
  weights: &weights weights

###############################################################################
model:

  # This is Baidu's DeepSpeech model:
  #   https://arxiv.org/abs/1412.5567
  # Kur makes prototyping different versions of it incredibly easy.

  # The model input is audio data (called utterances).
  - input: utterance

  # One-dimensional, variable-size convolutional layers to extract more
  # efficient representation of the data.
  - convolution:
      kernels: "{{ cnn.kernels }}"
      size: "{{ cnn.size }}"
      strides: "{{ cnn.stride }}"
      border: valid
  - activation: relu
  - batch_normalization

  # A series of recurrent layers to learn temporal sequences.
  - for:
      range: "{{ rnn.depth }}"
      iterate:
        - recurrent:
            size: "{{ rnn.size }}"
            sequence: yes
        - batch_normalization

  # A dense layer to get everything into the right output shape.
  - parallel:
      apply:
        - dense: "{{ vocab.size + 1 }}"
  - activation: softmax

  # The output is the transcription.
  - output: asr

###############################################################################
train:

  data:
    # A "speech_recognition" data supplier will create these data sources:
    #   utterance, utterance_length, transcript, transcript_length, duration
    - speech_recognition:
        <<: *data
        # url: "https://kur.deepgram.com/data/lsdc-train.tar.gz"
        # checksum: >-
        #   fc414bccf4de3964f895eaa9d0e245ea28810a94be3079b55505cf0eb1644f94

  weights: *weights
  provider:
    <<: *provider
    sortagrad: duration

  log: log

  optimizer:
    name: sgd
    nesterov: yes
    learning_rate: 2e-4
    momentum: 0.9
    clip:
      norm: 100

###############################################################################
validate: &validate
  data:
    - speech_recognition:
        <<: *data
        # path: "~/projects/speechrec/"
        # url: "https://kur.deepgram.com/data/lsdc-test.tar.gz"
        # checksum: >-
        #   e1c8cf9cd57e8c1ae952b6e4e40dcb5c8e3932c81ecd52c090e4a05c8ebbea2b

  weights: *weights
  provider: *provider

  hooks:
    - transcript

###############################################################################
test: *validate

###############################################################################
evaluate:
  <<: *validate
  provider:
    <<: *provider
    force_batch_size: no

###############################################################################
loss:
  - name: ctc
    # The model's output (its best-guest transcript).
    target: asr
    # How long the corresponding audio utterance is.
    input_length: utterance_length
    relative_to: utterance
    # How long the ground-truth transcript is.
    output_length: transcript_length
    # The ground-truth transcipt itself.
    output: transcript

...
noajshu commented 5 years ago

here is the output I get:

[WARNING 2019-01-06 14:22:03,840 kur.supplier.speechrec:465] Inferring vocabulary from data set.
[WARNING 2019-01-06 14:22:31,127 kur.supplier.speechrec:465] Inferring vocabulary from data set.
2019-01-06 14:24:09.074572: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at ctc_loss_op.cc:168 : Invalid argument: Saw a non-null label (index >= num_classes - 1) following a null label, batch: 0 num_classes: 29 labels: 
[ERROR 2019-01-06 14:24:11,046 kur.model.executor:352] Exception raised during training.
Traceback (most recent call last):
  File "/Users/noaj/.virtualenvs/kur/lib/python3.6/site-packages/kur/model/executor.py", line 349, in train
    **kwargs
  File "/Users/noaj/.virtualenvs/kur/lib/python3.6/site-packages/kur/model/executor.py", line 784, in wrapped_train
    self.compile('train', with_provider=provider)
  File "/Users/noaj/.virtualenvs/kur/lib/python3.6/site-packages/kur/model/executor.py", line 117, in compile
    **kwargs
  File "/Users/noaj/.virtualenvs/kur/lib/python3.6/site-packages/kur/backend/keras_backend.py", line 693, in compile
    self.wait_for_compile(model, key)
  File "/Users/noaj/.virtualenvs/kur/lib/python3.6/site-packages/kur/backend/keras_backend.py", line 723, in wait_for_compile
    self.run_batch(model, batch, key, False)
  File "/Users/noaj/.virtualenvs/kur/lib/python3.6/site-packages/kur/backend/keras_backend.py", line 766, in run_batch
    outputs = compiled['func'](inputs)
  File "/Users/noaj/.virtualenvs/kur/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2715, in __call__
    return self._call(inputs)
  File "/Users/noaj/.virtualenvs/kur/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
    fetched = self._callable_fn(*array_vals)
  File "/Users/noaj/.virtualenvs/kur/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1439, in __call__
    run_metadata_ptr)
  File "/Users/noaj/.virtualenvs/kur/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Saw a non-null label (index >= num_classes - 1) following a null label, batch: 0 num_classes: 29 labels: 
     [[{{node CTCLoss}} = CTCLoss[_class=["loc:@gradients/CTCLoss_grad/mul"], ctc_merge_repeated=true, ignore_longer_outputs_than_inputs=false, preprocess_collapse_repeated=false, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Log, ToInt64, GatherNd, Squeeze_1)]]
Traceback (most recent call last):
  File "/Users/noaj/.virtualenvs/kur/bin/kur", line 11, in <module>
    sys.exit(main())
  File "/Users/noaj/.virtualenvs/kur/lib/python3.6/site-packages/kur/__main__.py", line 494, in main
    sys.exit(args.func(args) or 0)
  File "/Users/noaj/.virtualenvs/kur/lib/python3.6/site-packages/kur/__main__.py", line 65, in train
    func(step=args.step)
  File "/Users/noaj/.virtualenvs/kur/lib/python3.6/site-packages/kur/kurfile.py", line 434, in func
    return trainer.train(**defaults)
  File "/Users/noaj/.virtualenvs/kur/lib/python3.6/site-packages/kur/model/executor.py", line 349, in train
    **kwargs
  File "/Users/noaj/.virtualenvs/kur/lib/python3.6/site-packages/kur/model/executor.py", line 784, in wrapped_train
    self.compile('train', with_provider=provider)
  File "/Users/noaj/.virtualenvs/kur/lib/python3.6/site-packages/kur/model/executor.py", line 117, in compile
    **kwargs
  File "/Users/noaj/.virtualenvs/kur/lib/python3.6/site-packages/kur/backend/keras_backend.py", line 693, in compile
    self.wait_for_compile(model, key)
  File "/Users/noaj/.virtualenvs/kur/lib/python3.6/site-packages/kur/backend/keras_backend.py", line 723, in wait_for_compile
    self.run_batch(model, batch, key, False)
  File "/Users/noaj/.virtualenvs/kur/lib/python3.6/site-packages/kur/backend/keras_backend.py", line 766, in run_batch
    outputs = compiled['func'](inputs)
  File "/Users/noaj/.virtualenvs/kur/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2715, in __call__
    return self._call(inputs)
  File "/Users/noaj/.virtualenvs/kur/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
    fetched = self._callable_fn(*array_vals)
  File "/Users/noaj/.virtualenvs/kur/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1439, in __call__
    run_metadata_ptr)
  File "/Users/noaj/.virtualenvs/kur/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Saw a non-null label (index >= num_classes - 1) following a null label, batch: 0 num_classes: 29 labels: 
     [[{{node CTCLoss}} = CTCLoss[_class=["loc:@gradients/CTCLoss_grad/mul"], ctc_merge_repeated=true, ignore_longer_outputs_than_inputs=false, preprocess_collapse_repeated=false, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Log, ToInt64, GatherNd, Squeeze_1)]]
noajshu commented 5 years ago

If I change the settings.data.path to path: "../../lsdc-test/" and train, everything works fine:

$ kur train speech.yml 
[WARNING 2019-01-06 14:29:24,288 kur.supplier.speechrec:465] Inferring vocabulary from data set.
[WARNING 2019-01-06 14:29:52,396 kur.supplier.speechrec:465] Inferring vocabulary from data set.

Epoch 1/inf, loss=270.119:  12%|██████████▍                                                                             | 32/271 [00:35<04:23,  1.10s/samples]
noajshu commented 5 years ago

looking at the corpus that does work (lsdc-test), I see no difference to my own:

Noahs-MacBook:speechrec noaj$ ls lsdc-test/
audio       lsdc-test.jsonl
Noahs-MacBook:speechrec noaj$ ls lsdc-test/audio/ | head -n10
013af52d-321a-44b3-a649-0930abb41f4a.wav
01d0aed1-264a-423a-912c-ef6471b7d16d.wav
0226bf2d-0cbb-4e9e-9e29-8b92c2dd9d85.wav
0248f382-153f-4844-b82b-5af6f872b4ee.wav
029dff52-c8e9-42b6-8783-cd1bffd77249.wav
05ffd75b-d2bd-46b3-a32c-463ced7147d4.wav
095109db-4f10-4988-8287-1a22c79dbdd5.wav
097ef85f-d377-4d37-b60d-f0e5bf7c666c.wav
0b611f3f-c6b1-48f0-85cc-a506a7d10022.wav
0b6fa32d-375c-45bb-9f87-a76d7af09190.wav
Noahs-MacBook:speechrec noaj$ head lsdc-test/lsdc-test.jsonl 
{"text": "the place seemed fragrant with all the riches of greek thought and song since the days when ptolemy philadelphus walked there with euclid and theocritus callimachus and lycophron", "duration_s": 11.72, "uuid": "e6d892b1-20f3-4cb9-ba62-f8d60302d78a"}
{"text": "the room had neither carpet nor fireplace and the only movables in it were a sofa bed a table and an arm chair all of such delicate and graceful forms as may be seen on ancient vases of a far earlier period than that whereof we write", "duration_s": 16.915, "uuid": "1540c191-da5d-4f60-8120-de5dc0277218"}
{"text": "but most probably had any of us entered that room that morning we should not have been able to spare a look either for the furniture or the general effect or the museum gardens or the sparkling mediterranean beyond but we should have agreed that the room was quite rich enough for human eyes for the sake of one treasure which it possessed and beside which nothing was worth a moment's glance", "duration_s": 24.395, "uuid": "99ca045e-fcd1-43b5-aa8e-f6f2903e5f9d"}
{"text": "she has lifted her eyes off her manuscript she is looking out with kindling countenance over the gardens of the museum her ripe curling greek lips such as we never see now even among her own wives and sisters open", "duration_s": 14.475, "uuid": "73624397-ca1c-420d-9b38-3538190e735e"}
{"text": "if they have ceased to guide nations they have not ceased to speak to their own elect", "duration_s": 5.63, "uuid": "40a83e85-93b2-4f82-800d-1a8cda921837"}
{"text": "if they have cast off the vulgar herd they have not cast off hypatia", "duration_s": 5.21, "uuid": "10fa7792-80be-4d44-9296-974553de5bdf"}
{"text": "to be welcomed into the celestial ranks of the heroic to rise to the immortal gods to the ineffable powers onward upward ever through ages and through eternities till i find my home at last and vanish in the glory of the nameless and the absolute one", "duration_s": 18.345, "uuid": "45dc87b1-4608-475c-b5b4-66c732364d13"}
{"text": "i to believe against the authority of porphyry himself too in evil eyes and magic", "duration_s": 5.97, "uuid": "e9a5ef19-fa45-40a3-a69c-53ddf7c69b1d"}
{"text": "what do i care for food", "duration_s": 2.155, "uuid": "108ad3a0-8acf-456e-831e-f51784ffa0fe"}
{"text": "how can he whose sphere lies above the stars stoop every moment to earth", "duration_s": 5.415, "uuid": "bc27f8b2-170e-4b99-96cb-d0c2f3c93b91"}
Noahs-MacBook:speechrec noaj$ soxi lsdc/audio/013af52d-321a-44b3-a649-0930abb41f4a.wav
soxi FAIL formats: can't open input file `lsdc/audio/013af52d-321a-44b3-a649-0930abb41f4a.wav': No such file or directory
Noahs-MacBook:speechrec noaj$ soxi lsdc-test/audio/013af52d-321a-44b3-a649-0930abb41f4a.wav

Input File     : 'lsdc-test/audio/013af52d-321a-44b3-a649-0930abb41f4a.wav'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:00:03.41 = 54560 samples ~ 255.75 CDDA sectors
File Size      : 109k
Bit Rate       : 256k
Sample Encoding: 16-bit Signed Integer PCM
scottstephenson commented 5 years ago

This is probably a vocabulary error. The vocab should be lowercase a-z, space, and asterisk. If your vocab contains any characters that are not in that bunch but you supply the same vocab file as lsdc-test, then you’ll have a bad time.

Make your dataset conform to the a-z, space, apostrophe, or delete the vocab file and run train again to have kur generate it for you.

noajshu commented 5 years ago

Thanks @scottstephenson this completely solved my problem! For now I will filter out numbers from my kur export but longer term I'm curious how I might modify the kurfile to work with more (~ 39) characters.

scottstephenson commented 5 years ago

You can use numbers and all that. It’s not a problem. It might be harder to train though. Just delete the vocab file and let kur generate it and see if it works out well. It’ll do fine, but maybe not better than omitting.

noajshu commented 5 years ago

There is no vocab file, this is character level output

scottstephenson commented 5 years ago

There is a character vocabulary, look for a vocab.json. The json object [“a”, ”b”, ”c”, ... , ”z”, “ “, “‘“] is the standard vocab, it’s all characters. You can include capitals or “1”, “2”, etc, and “?”, “!” and others. The vocab will be auto generated if you delete the vocab file and feed in a dataset that includes different characters.