deepgram / kur

Descriptive Deep Learning
Apache License 2.0
816 stars 107 forks source link

How to make variable-length input for seq2seq model #13

Closed go-bro closed 7 years ago

go-bro commented 7 years ago

Hello, I am new to Kur and trying to make a sequence-to-sequence model similar to this for machine translation: tensorflow seq2seq model

I need to have variable-length input and variable-length output. I think for output sequence I can use CTC like in the ASR example. To test, I tried setting the input to transcript just like the output of the speech model. This is the error I get:

kur -vvv train translator.yaml
[INFO 2017-02-21 15:29:03,435 kur.kurfile:699] Parsing source: translator.yaml, included by top-level.
[INFO 2017-02-21 15:29:03,455 kur.kurfile:82] Parsing Kurfile...
[DEBUG 2017-02-21 15:29:03,455 kur.kurfile:784] Parsing Kurfile section: settings
[DEBUG 2017-02-21 15:29:03,461 kur.kurfile:784] Parsing Kurfile section: train
[DEBUG 2017-02-21 15:29:03,466 kur.kurfile:784] Parsing Kurfile section: validate
[DEBUG 2017-02-21 15:29:03,469 kur.kurfile:784] Parsing Kurfile section: test
[DEBUG 2017-02-21 15:29:03,472 kur.kurfile:784] Parsing Kurfile section: evaluate
[DEBUG 2017-02-21 15:29:03,476 kur.containers.layers.placeholder:63] Using short-hand name for placeholder: transcript
[DEBUG 2017-02-21 15:29:03,476 kur.containers.layers.placeholder:97] Placeholder "transcript" has a deferred shape.
[DEBUG 2017-02-21 15:29:03,483 kur.containers.layers.output:50] Using short-hand name for output: decoding
[DEBUG 2017-02-21 15:29:03,484 kur.kurfile:784] Parsing Kurfile section: loss
[INFO 2017-02-21 15:29:03,486 kur.loggers.binary_logger:71] Loading log data: log
[DEBUG 2017-02-21 15:29:03,486 kur.loggers.binary_logger:78] Loading old-style binary logger.
[DEBUG 2017-02-21 15:29:03,487 kur.loggers.binary_logger:184] Loading binary column: training_loss_total
[DEBUG 2017-02-21 15:29:03,487 kur.loggers.binary_logger:192] No such log column exists: log/training_loss_total
[DEBUG 2017-02-21 15:29:03,487 kur.loggers.binary_logger:184] Loading binary column: training_loss_batch
[DEBUG 2017-02-21 15:29:03,487 kur.loggers.binary_logger:192] No such log column exists: log/training_loss_batch
[DEBUG 2017-02-21 15:29:03,487 kur.loggers.binary_logger:184] Loading binary column: validation_loss_total
[DEBUG 2017-02-21 15:29:03,487 kur.loggers.binary_logger:192] No such log column exists: log/validation_loss_total
[DEBUG 2017-02-21 15:29:03,487 kur.loggers.binary_logger:184] Loading binary column: validation_loss_batch
[DEBUG 2017-02-21 15:29:03,487 kur.loggers.binary_logger:192] No such log column exists: log/validation_loss_batch
[DEBUG 2017-02-21 15:29:05,391 kur.utils.package:233] File exists and passed checksum: /Users/noajshu/kur/lsdc-train.tar.gz
[DEBUG 2017-02-21 15:29:05,391 kur.supplier.speechrec:587] Unpacking input data: /Users/noajshu/kur/lsdc-train.tar.gz
[DEBUG 2017-02-21 15:29:18,819 kur.supplier.speechrec:612] Looking for metadata file.
[DEBUG 2017-02-21 15:29:18,819 kur.supplier.speechrec:647] Found metadata file: /Users/noajshu/kur/lsdc-train/lsdc-train.jsonl
[DEBUG 2017-02-21 15:29:18,819 kur.supplier.speechrec:648] Inferred source path: /Users/noajshu/kur/lsdc-train/audio
[DEBUG 2017-02-21 15:29:18,819 kur.supplier.speechrec:650] Scanning metadata file.
[DEBUG 2017-02-21 15:29:18,833 kur.supplier.speechrec:652] Entries counted: 2432
[DEBUG 2017-02-21 15:29:18,833 kur.supplier.speechrec:654] Loading metadata.
[DEBUG 2017-02-21 15:29:18,865 kur.supplier.speechrec:685] Entries kept: 2432
[DEBUG 2017-02-21 15:29:18,866 kur.supplier.speechrec:501] Using all available data.
[DEBUG 2017-02-21 15:29:18,866 kur.supplier.speechrec:442] Creating sources.
[INFO 2017-02-21 15:29:18,866 kur.supplier.speechrec:144] Restoring normalization statistics: norm.yml
[INFO 2017-02-21 15:29:18,866 kur.utils.normalize:185] Restoring normalization state from: norm.yml
[INFO 2017-02-21 15:29:24,688 kur.supplier.speechrec:307] Inferring vocabulary from data set.
[INFO 2017-02-21 15:29:24,710 kur.supplier.speechrec:342] Loaded a 28-word vocabulary.
[DEBUG 2017-02-21 15:29:24,711 kur.providers.batch_provider:57] Batch size set to: 16
[DEBUG 2017-02-21 15:29:24,712 kur.providers.batch_provider:64] Batch provider will force batches of exactly 16 samples.
[DEBUG 2017-02-21 15:29:24,946 kur.utils.package:233] File exists and passed checksum: /Users/noajshu/kur/lsdc-test.tar.gz
[DEBUG 2017-02-21 15:29:24,946 kur.supplier.speechrec:587] Unpacking input data: /Users/noajshu/kur/lsdc-test.tar.gz
[DEBUG 2017-02-21 15:29:26,521 kur.supplier.speechrec:612] Looking for metadata file.
[DEBUG 2017-02-21 15:29:26,521 kur.supplier.speechrec:647] Found metadata file: /Users/noajshu/kur/lsdc-test/lsdc-test.jsonl
[DEBUG 2017-02-21 15:29:26,521 kur.supplier.speechrec:648] Inferred source path: /Users/noajshu/kur/lsdc-test/audio
[DEBUG 2017-02-21 15:29:26,521 kur.supplier.speechrec:650] Scanning metadata file.
[DEBUG 2017-02-21 15:29:26,521 kur.supplier.speechrec:652] Entries counted: 271
[DEBUG 2017-02-21 15:29:26,521 kur.supplier.speechrec:654] Loading metadata.
[DEBUG 2017-02-21 15:29:26,524 kur.supplier.speechrec:685] Entries kept: 271
[DEBUG 2017-02-21 15:29:26,524 kur.supplier.speechrec:501] Using all available data.
[DEBUG 2017-02-21 15:29:26,524 kur.supplier.speechrec:442] Creating sources.
[INFO 2017-02-21 15:29:26,524 kur.supplier.speechrec:144] Restoring normalization statistics: norm.yml
[INFO 2017-02-21 15:29:26,524 kur.utils.normalize:185] Restoring normalization state from: norm.yml
 [INFO 2017-02-21 15:29:32,311 kur.supplier.speechrec:307] Inferring vocabulary from data set.
[INFO 2017-02-21 15:29:32,313 kur.supplier.speechrec:342] Loaded a 28-word vocabulary.
[DEBUG 2017-02-21 15:29:32,314 kur.providers.batch_provider:57] Batch size set to: 16
[DEBUG 2017-02-21 15:29:32,314 kur.providers.batch_provider:64] Batch provider will force batches of exactly 16 samples.
[DEBUG 2017-02-21 15:29:32,314 kur.backend.backend:187] Using backend: keras
[INFO 2017-02-21 15:29:32,315 kur.backend.backend:80] Creating backend: keras
[INFO 2017-02-21 15:29:32,315 kur.backend.backend:83] Backend variants: none
[INFO 2017-02-21 15:29:32,315 kur.backend.keras_backend:81] The tensorflow backend for Keras has been requested.
[DEBUG 2017-02-21 15:29:32,315 kur.backend.keras_backend:189] Overriding environmental variables: {'KERAS_BACKEND': 'tensorflow', 'THEANO_FLAGS': None, 'TF_CPP_MIN_LOG_LEVEL': '1'}
[INFO 2017-02-21 15:29:34,063 kur.backend.keras_backend:195] Keras is loaded. The backend is: tensorflow
[INFO 2017-02-21 15:29:34,064 kur.model.model:260] Enumerating the model containers.
[INFO 2017-02-21 15:29:34,064 kur.model.model:265] Assembling the model dependency graph.
[DEBUG 2017-02-21 15:29:34,064 kur.model.model:272] Assembled Node: transcript
[DEBUG 2017-02-21 15:29:34,064 kur.model.model:274]   Uses:
[DEBUG 2017-02-21 15:29:34,065 kur.model.model:276]   Used by: ..recurrent.0
[DEBUG 2017-02-21 15:29:34,065 kur.model.model:277]   Aliases: transcript
[DEBUG 2017-02-21 15:29:34,065 kur.model.model:272] Assembled Node: ..recurrent.0
[DEBUG 2017-02-21 15:29:34,065 kur.model.model:274]   Uses: transcript
[DEBUG 2017-02-21 15:29:34,065 kur.model.model:276]   Used by: ..batch_normalization.0
[DEBUG 2017-02-21 15:29:34,065 kur.model.model:277]   Aliases: ..recurrent.0
[DEBUG 2017-02-21 15:29:34,065 kur.model.model:272] Assembled Node: ..batch_normalization.0
[DEBUG 2017-02-21 15:29:34,066 kur.model.model:274]   Uses: ..recurrent.0
[DEBUG 2017-02-21 15:29:34,066 kur.model.model:276]   Used by: ..recurrent.1
[DEBUG 2017-02-21 15:29:34,066 kur.model.model:277]   Aliases: ..batch_normalization.0
[DEBUG 2017-02-21 15:29:34,066 kur.model.model:272] Assembled Node: ..recurrent.1
[DEBUG 2017-02-21 15:29:34,066 kur.model.model:274]   Uses: ..batch_normalization.0
[DEBUG 2017-02-21 15:29:34,066 kur.model.model:276]   Used by: ..batch_normalization.1
[DEBUG 2017-02-21 15:29:34,066 kur.model.model:277]   Aliases: ..recurrent.1
[DEBUG 2017-02-21 15:29:34,066 kur.model.model:272] Assembled Node: ..batch_normalization.1
[DEBUG 2017-02-21 15:29:34,067 kur.model.model:274]   Uses: ..recurrent.1
[DEBUG 2017-02-21 15:29:34,067 kur.model.model:276]   Used by: ..recurrent.2
[DEBUG 2017-02-21 15:29:34,067 kur.model.model:277]   Aliases: ..batch_normalization.1
[DEBUG 2017-02-21 15:29:34,067 kur.model.model:272] Assembled Node: ..recurrent.2
[DEBUG 2017-02-21 15:29:34,067 kur.model.model:274]   Uses: ..batch_normalization.1
[DEBUG 2017-02-21 15:29:34,067 kur.model.model:276]   Used by: ..batch_normalization.2
[DEBUG 2017-02-21 15:29:34,068 kur.model.model:277]   Aliases: ..recurrent.2
[DEBUG 2017-02-21 15:29:34,068 kur.model.model:272] Assembled Node: ..batch_normalization.2
[DEBUG 2017-02-21 15:29:34,068 kur.model.model:274]   Uses: ..recurrent.2
[DEBUG 2017-02-21 15:29:34,068 kur.model.model:276]   Used by: ..activation.0
[DEBUG 2017-02-21 15:29:34,068 kur.model.model:277]   Aliases: ..batch_normalization.2, ..for.0
[DEBUG 2017-02-21 15:29:34,068 kur.model.model:272] Assembled Node: ..activation.0
[DEBUG 2017-02-21 15:29:34,068 kur.model.model:274]   Uses: ..batch_normalization.2
[DEBUG 2017-02-21 15:29:34,069 kur.model.model:276]   Used by: decoding
[DEBUG 2017-02-21 15:29:34,069 kur.model.model:277]   Aliases: ..activation.0
[DEBUG 2017-02-21 15:29:34,069 kur.model.model:272] Assembled Node: decoding
[DEBUG 2017-02-21 15:29:34,069 kur.model.model:274]   Uses: ..activation.0
[DEBUG 2017-02-21 15:29:34,069 kur.model.model:276]   Used by:
[DEBUG 2017-02-21 15:29:34,069 kur.model.model:277]   Aliases: decoding
[INFO 2017-02-21 15:29:34,069 kur.model.model:280] Connecting the model graph.
[DEBUG 2017-02-21 15:29:34,070 kur.model.model:311] Building node: transcript
[DEBUG 2017-02-21 15:29:34,070 kur.model.model:312]   Aliases: transcript
[DEBUG 2017-02-21 15:29:34,070 kur.model.model:313]   Inputs:
[DEBUG 2017-02-21 15:29:34,070 kur.containers.layers.placeholder:117] Creating placeholder for "transcript" with data type "float32".
[DEBUG 2017-02-21 15:29:34,070 kur.model.model:125] Trying to infer shape for input "transcript"
[DEBUG 2017-02-21 15:29:34,070 kur.model.model:143] Inferred shape for input "transcript": (None,)
[DEBUG 2017-02-21 15:29:34,070 kur.containers.layers.placeholder:127] Inferred shape: (None,)
[DEBUG 2017-02-21 15:29:34,098 kur.model.model:382]   Value: Tensor("transcript:0", shape=(?, ?), dtype=float32)
[DEBUG 2017-02-21 15:29:34,098 kur.model.model:311] Building node: ..recurrent.0
[DEBUG 2017-02-21 15:29:34,098 kur.model.model:312]   Aliases: ..recurrent.0
[DEBUG 2017-02-21 15:29:34,098 kur.model.model:313]   Inputs:
[DEBUG 2017-02-21 15:29:34,099 kur.model.model:315]   - transcript: Tensor("transcript:0", shape=(?, ?), dtype=float32)
Traceback (most recent call last):
  File "/Users/noajshu/.virtualenvs/rnn-translator/bin/kur", line 11, in <module>
    load_entry_point('kur==0.3.0', 'console_scripts', 'kur')()
  File "/Users/noajshu/.virtualenvs/rnn-translator/lib/python3.6/site-packages/kur/__main__.py", line 382, in main
    sys.exit(args.func(args) or 0)
  File "/Users/noajshu/.virtualenvs/rnn-translator/lib/python3.6/site-packages/kur/__main__.py", line 61, in train
    func = spec.get_training_function()
  File "/Users/noajshu/.virtualenvs/rnn-translator/lib/python3.6/site-packages/kur/kurfile.py", line 329, in get_training_function
    model = self.get_model(provider)
  File "/Users/noajshu/.virtualenvs/rnn-translator/lib/python3.6/site-packages/kur/kurfile.py", line 152, in get_model
    self.model.build()
  File "/Users/noajshu/.virtualenvs/rnn-translator/lib/python3.6/site-packages/kur/model/model.py", line 282, in build
    self.build_graph(input_nodes, output_nodes, network)
  File "/Users/noajshu/.virtualenvs/rnn-translator/lib/python3.6/site-packages/kur/model/model.py", line 339, in build_graph
    target=layer
  File "/Users/noajshu/.virtualenvs/rnn-translator/lib/python3.6/site-packages/kur/backend/keras_backend.py", line 242, in connect
    return target(inputs)
  File "/Users/noajshu/.virtualenvs/rnn-translator/lib/python3.6/site-packages/keras/engine/topology.py", line 529, in __call__
    self.assert_input_compatibility(x)
  File "/Users/noajshu/.virtualenvs/rnn-translator/lib/python3.6/site-packages/keras/engine/topology.py", line 469, in assert_input_compatibility
    str(K.ndim(x)))
ValueError: Input 0 is incompatible with layer ..recurrent.0: expected ndim=3, found ndim=2

Here is my Kurfile:

settings:

  # Deep learning model
  # cnn:
  #   kernels: 1000
  #   size: 11
  #   stride: 2
  rnn:
    size: 1000
    depth: 3
  vocab:
    # Need for CTC
    size: 28

  # Setting up the backend.
  backend:
    name: keras
    backend: tensorflow

  # Batch sizes
  provider: &provider
    batch_size: 16
    force_batch_size: yes

  # Where to put the data.
  data: &data
    path: "~/kur"
    type: spec
    max_duration: 50
    max_frequency: 8000
    normalization: norm.yml

  # Where to put the weights
  weights: &weights weights

###############################################################################
model:

  # This is Baidu's DeepSpeech model:
  #   https://arxiv.org/abs/1412.5567
  # Kur makes prototyping different versions of it incredibly easy.

  # The model input is audio data (called utterances).
  - input: transcript

  # One-dimensional, variable-size convolutional layers to extract more
  # efficient representation of the data.
  # - convolution:
  #     kernels: "{{ cnn.kernels }}"
  #     size: "{{ cnn.size }}"
  #     strides: "{{ cnn.stride }}"
  #     border: valid
  # - activation: relu
  # - batch_normalization

  # A series of recurrent layers to learn temporal sequences.
  - for:
      range: "{{ rnn.depth }}"
      iterate:
        - recurrent:
            size: "{{ rnn.size }}"
            sequence: yes
        - batch_normalization

  # # A dense layer to get everything into the right output shape.
  # - parallel:
  #     apply:
  #       - dense: "{{ vocab.size + 1 }}"
  - activation: softmax

  # The output is the transcription.
  - output: decoding

###############################################################################
train:

  data:
    # A "speech_recognition" data supplier will create these data sources:
    #   utterance, utterance_length, transcript, transcript_length, duration
    - speech_recognition:
        <<: *data
        url: "http://kur.deepgram.com/data/lsdc-train.tar.gz"
        checksum: >-
          fc414bccf4de3964f895eaa9d0e245ea28810a94be3079b55505cf0eb1644f94

  weights: *weights
  provider:
    <<: *provider
    sortagrad: duration

  log: log

  optimizer:
    name: sgd
    nesterov: yes
    learning_rate: 2e-4
    momentum: 0.9
    clip:
      norm: 100

###############################################################################
validate: &validate
  data:
    - speech_recognition:
        <<: *data
        url: "http://kur.deepgram.com/data/lsdc-test.tar.gz"
        checksum: >-
          e1c8cf9cd57e8c1ae952b6e4e40dcb5c8e3932c81ecd52c090e4a05c8ebbea2b

  weights: *weights
  provider: *provider

  hooks:
    - transcript

###############################################################################
test: *validate

###############################################################################
evaluate: *validate

###############################################################################
loss:
  - name: ctc
    # The model's output (its best-guest transcript).
    target: decoding
    # How long the corresponding audio utterance is.
    input_length: transcript_length
    relative_to: transcript
    # How long the ground-truth transcript is.
    output_length: transcript_length
    # The ground-truth transcipt itself.
    output: transcript

...

Could you give me some advice on getting this network configured to use character text sequence input? Do I need to write a new supplier?

ajsyp commented 7 years ago

There are a couple different ways to implement sequence-to-sequence models; for an academic overview see Cho et al, 2014 and Sutskever et al, 2014. Let me propose a simple starting point (more complicated examples would likely require hacking the current layers, which is doable but less straightforward).

The simple approach is to take your input sequence and pass it through an RNN stack, keeping only the last timestep. This last timestep is a vector encoding of the input sequence, in some sense. We can then present this vector to the decoder RNNs at each timestep, asking it to produce the output sequence given the encoding vector.

model:

  # The input sequence. Let's assume your input sequences have 100 timesteps,
  # are are each a one-hot vector representation of a 29-word vocabulary.
  - input:
      shape: [100, 29]
    name: input_sequence

  # The encoder stack. We can have as many encoder layers as we want...
  - recurrent:
      size: 512
  # ... provided that our last encoder layer produces a single vector.
  - recurrent:
      size: 512
      sequence: no

  # Optionally, we can introduce another dense layer.
  - dense: 100
  - activation: relu

  # Now we will repeat this encoded vector and present it to the decoder layer
  # fifty times.
  - repeat: 50

  # The decoder stack.
  - recurrent:
      size: 512

  # Your last decoder layer could have `size` equal to the output vocabulary
  # size, or we can use a dense layer to reshape each output sequence.
  - parallel:
      apply:
        - dense: 29

  # Softmax so that we get one-hot outputs, suitable for categorical
  # cross-entropy loss.
  - activation: softmax

  # The final output sequence. Each output has 50 timesteps, and each timestep
  # is a one-hot encoded vector representation of a 29-word vocabulary.
  - output: output_sequence

(Note that the "repeat" layer was only recently added to Kur, so be sure to use the latest version from GitHub.)

Sutskever et al, 2014 found that their models worked best when the input sequence was reversed. So instead trying to train the mapping "A B C" -> "X Y Z", instead try mapping "C B A -> X Y Z". Thus, the recommendation on preprocessing your data would be:

  1. Reverse the input sequences.
  2. Left-pad the input sequences with zeros (or use [null, 29] as the shape of the input layer, and use batches of size 1).
  3. Add an end-of-sequence marker to the output. This will count as a "word" in your vocabulary, but provides additional flexibility, because now your model is forced to learn a distribution over all possible sequence lengths.
  4. Right-pad the output sequences with zeros.

You can also play with different "signal" words, like the or words. For example, you might try putting an marker at the end of your reversed inputs, too.

go-bro commented 7 years ago

Thanks! I'm getting some good results with the repeat layer 👍