Comments for Speech Recognition from scratch using Dilated Convolutions and CTC in Tensorflow

jonjensen commented 5 years ago

Comments for https://www.endpointdev.com/blog/2019/01/speech-recognition-with-tensorflow/ By Kamil Ciemniewski

To enter a comment:

Log in to GitHub
Leave a comment on this issue.

ivelin commented 5 years ago

Fantastic post @jonjensen . Thank you for making the time to detail each step of the way.

Have you thought about a follow up post that addresses streaming audio. Issues such as:

using a streaming packet mode through TF models instead of chunks of HTTP JSON posts
loss in RTP packets
removing or ignoring background noise
GPU performance tuning for real time speech-to-text conversion

Thanks again for a great writeup.

Ivelin

kamilc commented 5 years ago

@ivelin Thank you for your kind words. These are really fantastic ideas for the follow-up series!

The first two would extend the model serving to be closer to real-world usages. The third one (removing or ignoring the background noise) was somewhat included in the post already. The simplistic idea was to add random and real-world noise at the data augmentation step to make the trained model more robust to it. There's a lot more to be done potentially of course still. As for the 4th one: I'd need to benchmark the performance of what we have already. It would be then interesting to treat it as a baseline and explore the potential improvements.

All of the above is immensely interesting. I'll need to fit it somehow into my busy schedule. No promises then but you've made me itching for another blog post :)

Best, Kamil

seankerman commented 5 years ago

Kamil, Great post! I noticed that Common Voice updated their data set a few months back. I didn't see a way to download the old one. In any case, I modified the code to load the new CV data set. Everything seems to run fine, however the edit distance bottoms out at about 0.8 after about 30 min of training. I didn't change much else in notebook. I did do a little bit of cleaning up on the alphabet (removing parentheses, brackets, etc.) but that's about it. Any idea what might be going on?

kamilc commented 5 years ago

@seankerman Might it be that your labels are getting mixed? I had an issue in my input_fn making the data and labels out of sync — I don't recall what specifically it was though. Did you run the tests? There's a test_dataset_returns_data_in_order case that checks exactly for that.

You could also use the tf.summary.text and tf.summary.audio and see in TensorBoard if they match. It definitely should continue converging both in loss and the metrics (edit distance) steadily, especially within the first hours / days.

Hope it helps. Kamil

NitinShuklaML commented 5 years ago

@kamilc Fantastic post. Really loved your detailed explanation of what was definitely an arduous task. Just one query. Will your model work on a Windows machine if a pull in the Tensorflow image using docker and wrap a package.json file to it?

kamilc commented 5 years ago

@NitinShuklaML Thank you!

As for running it: I haven't tried myself but from what I know it works via Docker on Windows just fine. I'd wonder about GPU support but people discussing this gist seem to suggest it's not a problem either. The inference is pretty quick even without it - if you just want to use the SavedModel from the complementary repo.

You could even use the curl exactly as I did in the article by using the one that comes with WSL or the MinGW.

Have fun! :)

NitinShuklaML commented 5 years ago

Hi @kamilc ,

I installed docker on Windows and followed instructions on your blog post. However when i tried to mount the saved model directory ({pwd/speech-recognition/best} in my case) using the following command.

docker run -t --rm -p 8501:8501 -v "/speech-recognition/best/1546646971:/models/speech/1" -e MODEL_NAME=speech tensorflow/serving

I got the following error:

2019-06-12 10:41:13.579523: I tensorflow_serving/model_servers/server.cc:82] Building single TensorFlow model file config: model_name: speech model_base_path: /models/speech 2019-06-12 10:41:13.581488: I tensorflow_serving/model_servers/server_core.cc:461] Adding/updating models. 2019-06-12 10:41:13.581569: I tensorflow_serving/model_servers/server_core.cc:558] (Re-)adding model: speech 2019-06-12 10:41:13.682551: I tensorflow_serving/core/basic_manager.cc:739] Successfully reserved resources to load servable {name: speech version: 1} 2019-06-12 10:41:13.682627: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: speech version: 1} 2019-06-12 10:41:13.682656: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: speech version: 1} 2019-06-12 10:41:13.682782: E tensorflow_serving/util/retrier.cc:37] Loading servable: {name: speech version: 1} failed: Not found: Specified file path does not appear to contain a:

Session bundle (should have a file called export.meta)
or, SavedModel bundle (should have a file called saved_model.pb) Specified file path: /models/speech/1

Can you drive me in any meaningful direction regarding how i shall proceed? Am i doing something wrong? Attaching the screenshot of the console in Attachments.

DockerMountError

kamilc commented 5 years ago

@NitinShuklaML - it seems to me that the -v "/speech-recognition/best/1546646971:/models/speech/1" part might be the culprit. The first part of the -v <host-dir>:<in-container-dir> (the host-dir) should be specified the way you'd address it from your OS's console. Let's say you have the model under ".\speech-recognition\best\1546646971" locally (notice the Windows way of separating directories in the path). I haven't used docker with Windows but from what I can see in the docs, it's most likely to expect you to give it something resembling this: -v .\speech-recognition\best\1546646971:/models/speech/1. That assumes of course that .\speech-recognition\best\1546646971 exists relatively to your working directory when running the docker command. I have no way to test it but could you try it out?

NitinShuklaML commented 5 years ago

@kamilc ,

Thanks for such a quick response. I tried your suggestion but i am getting a new error.

C:\Users\Nitin>docker run -t --rm -p 8501:8501 -v .\speech-recognition\best\1546646971:/models/speech/1 -e MODEL_NAME=speech tensorflow/serving docker: Error response from daemon: Mount denied: The source path ".\speech-recognition\best\1546646971" is not a valid Windows path. See 'docker run --help'.

However, there is another thing that is bothering me because i could not understand it properly.

docker run -t --rm -p 8501:8501 -v "/speech-recognition/best/1546646971:/models/speech/1" -e MODEL_NAME=speech tensorflow/serving

What does the number 1546646971 mean? I downloaded your model and unzipped it but all i got was the expanded best.tar and no sub-directory. I am sorry if these questions are too basic.

Attaching the screenshot of my Directory:

DirectoryStructure

kamilc commented 5 years ago

@NitinShuklaML The best\1546646971 is the folder containing the saved_model.pb file. From what I can see on the screenshot you've attached - you don't have the best.tar.bz2 fully extracted. On Unix-like systems you'd do this: tar xvjf best.tar.bz2 to do it. There are many de-archivers for Windows that can handle it too. The archive itself contains the 1546646971/ folder. You need to point Docker at it. So say that you've extracted the contents of best.tar.bz2 to best so that you can see the .\best\1546646971 directory. Then doing -v ".\best\1546646971:/models/speech/1" should work.

Hope this helps

NitinShuklaML commented 5 years ago

@kamilc This did work but i had to give the entire path instead of relative path. Thanks for your quick and accurate responses. Your project has helped me scale up on several interesting concepts in Machine Learning. Thanks

kamilc commented 5 years ago

@NitinShuklaML Great! I'm glad I could help

NitinShuklaML commented 5 years ago

Hi @kamilc ,

I have another query regarding the POST request using curl over the API that we have exposed.

I would like to understand the format of parameter. Because i naively had put down the path of my audio file (.wav) here but apparently the API is looking for a float value.

{"inputs": {"audio": , "length": }}

When i run cmd command with wav file path i get the following error.

{ "error": "JSON Parse error: Invalid value. at offset: 21" }

So i wanted to know what is the actual data format you are passing in both the above parameters of payload.json file?

kamilc commented 5 years ago

@NitinShuklaML The value for "audio" is expected to be an array of floats. Imagine the model being served at a remote server. Just passing your local file path means nothing on that remote machine. This is why it's expected to give it the raw data directly. How to get this array? You can use the snippet from the beginning of the article:

import librosa

SAMPLING_RATE=16000

wave, _ = librosa.load(path_to_file, sr=SAMPLING_RATE)

Here, the wave variable is a numpy array of floats - exactly what's needed. You could e. g. print it on the Python's repl and copy&paste it into your JSON payload file. You could also post the request from Python using the requests library as explained here: https://2.python-requests.org/en/master/user/quickstart/#more-complicated-post-requests

NitinShuklaML commented 5 years ago

Hi @kamilc ,

I followed your instructions as per last comment. I made a payload.json file and copied the ndarray of the audio file inside the "audio" key. But i got a json response with an empty text key while getting responses for logits. Also the raw ndarray threw a JSON Parse error which were fixed when i introduced commas between individual elements of the ndarray.

I am attaching my payload.json file (inside a zipped folder)

payload.zip

APIoutput file in my comment.

API-output.txt

In case you have issues while unzipping i am also uploading the payload.json file as a txt file:

payload.txt

PS: I tried posting directly via Python but i got a JSON parse error.

It feels like i am so near to get this great project working.

kamilc commented 5 years ago

@NitinShuklaML this payload file doesn't seem right, unfortunately.

Firstly, the length in your file says 3 while it should be the length of the audio array. This is because during the training the examples are zero-padded on the right. This additional length parameter makes the computations faster but also more correct (the CTC loss doesn't need to be computed for the zero-padded postfix - hence the gradient isn't affected).

Secondly, the values in your array don't seem like valid audio data. They are mostly within -1e-02 and 1e02. Here's how real data looks like:

array([-8.3094210e-07, -2.7989950e-05, -3.3332242e-05, ...,
        1.3709553e-08, -9.8635171e-09,  0.0000000e+00], dtype=float32)

Can you see the range of values?

Here's how you could dump the proper JSON in Python:

import librosa
import json

SAMPLING_RATE=16000

path_to_file = 'your-test-file.wav'
wave, _ = librosa.load(path_to_file, sr=SAMPLING_RATE)
data = { 'inputs': { 'audio': wave.tolist(), 'length': len(wave) } }

with open('result.json', 'w') as fp:
    json.dump(data, fp)

NitinShuklaML commented 5 years ago

Hi @kamilc ,

Thanks for helping out again. I am able to get responses in text. However the text dosen't seem right. I used your test-me.m4a audio (It seems to work just fine). I am getting text that just dosen't make sense.

Expected Text

It seems to work just fine

Actual Text

[['w', 'n', 'h', 'e', 't', 'w', 'e', 't', ' ', 's', 'e', ' ', 'n', ' ', 't', 'h', 'o', 'l', 'a', 'o', 'e', ' ', 'g', 'd', 'j', 'u', 'n', 't', ' ', 'e', 'm', 'e', 't', 'i', 'n', ' ', 'p', 'a', 'n']]

Is it because your's is a GPU trained Model while i am trying it on a RAM?

I used ffmpeg to convert m4a to wave file. Attaching the m4a, wave file , json generated and response text.

Attaching original test-me.m4a file, test-me-wav.wav file, payload.json file

test-params.zip

AlanPerry-MS commented 5 years ago

@kamilc ,

Great Blog Post. Also the Comments thread is extremely informative i basically setup this project on my Windows because of the Comment section on your blog Post. However i am also encountering the same issue as mentioned in the above post by @NitinShuklaML . I wonder if this is due to a mismatch of the wav file configuration? I tried with 16khz mono at 16 bit but getting gibberish text for your test-me.m4a file. Can you share the specs of wav file that you use as input? The model you developed is sensitive to the specs is my guess.

kamilc commented 5 years ago

@NitinShuklaML @AlanPerry-MS I noticed it too here locally but haven't had the time to troubleshoot. My initial thought is that I might have mistakenly pushed a non-optimal SavedModel - not the one that gave the results shown in the article. Will try to squeeze in some time to check and potentially retrain - but no promises with regards to the ETA.

AlanPerry-MS commented 5 years ago

@kamilc,

Thanks Kamil for the response. Really appreciate your quick response.

tanmayrsm commented 5 years ago

Hi kamilc ,can we create something similar to real time speech to text using microphone ? I am finding it difficult to get a model to train and code accordingly to use it for real time speech to text

Honghe commented 4 years ago

@kamilc Thank you for your example. But in our environment, There is a following problem. Env:

Tensorflow-gpu 1.15
Ubuntu 18.04

The GPU has little memory used, but GPU-Util is always 0%, And CPU is often in high usage percent. And there is memory leak.

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'stats/causal_convolutions_False/codename_deep_max_20_seconds/frame_step_640/lower_edge_hertz_0/n_fft_1280/num_mel_bins_160/optimizer_Momentum/sampling_rate_16000/stack_dilation_rates_1_3_9_27/stack_filters_384/stack_kernel_size_7/stacks_6/upper_edge_hertz_8000', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f744e4bada0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Not using Distribute Coordinator.
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 600.
WARNING:tensorflow:From /home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
{'parallelize': False, 'shuffle': True, 'max_text_length': None, 'min_text_length': None, 'max_wave_length': 320000, 'random_shift_min': -4000, 'random_shift_max': 4000, 'random_stretch_min': 0.8, 'random_stretch_max': 1.2, 'random_noise': 0.75, 'random_noise_factor_min': 0.1, 'random_noise_factor_max': 0.15, 'epochs': 10000, 'batch_size': 18, 'augment': False}
Constraining dataset to the max_wave_length
Resulting dataset length: 2071
INFO:tensorflow:Calling model_fn.
Is training? True
WARNING:tensorflow:From <ipython-input-32-8bb2a1feb940>:39: batch_normalization (from tensorflow.python.layers.normalization) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.BatchNormalization instead.  In particular, `tf.control_dependencies(tf.GraphKeys.UPDATE_OPS)` should not be used (consult the `tf.keras.layers.batch_normalization` documentation).
WARNING:tensorflow:From /home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/tensorflow_core/python/layers/normalization.py:327: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.__call__` method instead.
WARNING:tensorflow:From /home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/tensorflow_core/python/ops/clip_ops.py:301: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from stats/causal_convolutions_False/codename_deep_max_20_seconds/frame_step_640/lower_edge_hertz_0/n_fft_1280/num_mel_bins_160/optimizer_Momentum/sampling_rate_16000/stack_dilation_rates_1_3_9_27/stack_filters_384/stack_kernel_size_7/stacks_6/upper_edge_hertz_8000/model.ckpt-147
WARNING:tensorflow:From /home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py:1069: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file utilities to get mtimes.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 147 into stats/causal_convolutions_False/codename_deep_max_20_seconds/frame_step_640/lower_edge_hertz_0/n_fft_1280/num_mel_bins_160/optimizer_Momentum/sampling_rate_16000/stack_dilation_rates_1_3_9_27/stack_filters_384/stack_kernel_size_7/stacks_6/upper_edge_hertz_8000/model.ckpt.
INFO:tensorflow:loss = 841.5093, step = 148
INFO:tensorflow:Saving checkpoints for 178 into stats/causal_convolutions_False/codename_deep_max_20_seconds/frame_step_640/lower_edge_hertz_0/n_fft_1280/num_mel_bins_160/optimizer_Momentum/sampling_rate_16000/stack_dilation_rates_1_3_9_27/stack_filters_384/stack_kernel_size_7/stacks_6/upper_edge_hertz_8000/model.ckpt.
WARNING:tensorflow:From /home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py:963: remove_checkpoint (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to delete files with this prefix.
{'parallelize': False, 'shuffle': True, 'max_text_length': None, 'min_text_length': None, 'max_wave_length': 320000, 'random_shift_min': -4000, 'random_shift_max': 4000, 'random_stretch_min': 0.8, 'random_stretch_max': 1.2, 'random_noise': 0.75, 'random_noise_factor_min': 0.1, 'random_noise_factor_max': 0.15, 'epochs': 10000, 'batch_size': 18, 'augment': False}
Constraining dataset to the max_wave_length
Resulting dataset length: 230
INFO:tensorflow:Calling model_fn.
Is training? False
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-11-23T22:43:30Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from stats/causal_convolutions_False/codename_deep_max_20_seconds/frame_step_640/lower_edge_hertz_0/n_fft_1280/num_mel_bins_160/optimizer_Momentum/sampling_rate_16000/stack_dilation_rates_1_3_9_27/stack_filters_384/stack_kernel_size_7/stacks_6/upper_edge_hertz_8000/model.ckpt-178
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/multiprocessing/pool.py", line 405, in _handle_workers
    pool._maintain_pool()
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/multiprocessing/pool.py", line 246, in _maintain_pool
    self._repopulate_pool()
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/multiprocessing/pool.py", line 239, in _repopulate_pool
    w.start()
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/multiprocessing/popen_fork.py", line 66, in _launch
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

kamilc commented 4 years ago

@Honghe

This might be due to numerous reasons. The first coming to mind is that the preprocessing step is done fully in CPU by design (I wanted to throw as big batches into the GPU as possible). Before it's preprocessed (augmented with random stretch, shift, and noise) it also has to be read from disk. So if there's any bottleneck here (for whatever reason) - I wouldn't be surprised seeing the GPU sitting there just bored... I could imagine a setup with as powerful GPU as shown on the screenshot but extremely slow disks, resulting in the GPU constantly waiting for data.

This is all just me playing the guessing game of course as I don't know much about your hardware and software setup and whether you've changed anything in code or not. Having another look at the screenshot, just 153MiB of GPU allocated means that it's not even a full batch in memory (as it takes way more than that...). Did you test your setup against some other, possibly simpler code? Can you try it out in REPL and see if Tensorflow is allocating tensors in the GPU?

For how long have you had the training process running before you took that screenshot? I might be able to come up with ideas but you'd need to give me a bit more details.

Best

kamilc commented 4 years ago

@tanmayrsm It took me three months to see your question - apologies for the unintended silence!

Not looking at the literature first here are my thoughts:

If I were to code a model for real time speech recognition (that wouldn’t require you to re-send the same data over and over again) I’d look into feeding the “state” vector along the audio data. Here’s why: we’re constrained by the signal length as our tensors have to have a definite size.

The idea would be to return not only the logits from the network but also the state vector that encodes the context up till the current “window of audio data”. Obvious question arises: how does one train a model like this - where consecutive examples depend on the outputs of the previous ones? This might get very tricky in practice. For example: I wouldn’t be surprised seeing the gradient being thrown off by the inaccurate state vectors in the beginning of training. As always - it’s best to read what’s been written in papers first.

Honghe commented 4 years ago

@kamilc Thank you! I have updated the full error log and parameters on the last comment. OSError: [Errno 12] Cannot allocate memory happens every time it runs on Is training? False stage. And may I ask what tensorflow-gpu version do you use?

kamilc commented 4 years ago

@Honghe It's failing somewhere in the preprocessing part. Notice how the generator_fn (that lives within the input_fn) uses Python's multiprocessing.Pool. When the dataset_params receives parallelize=True it will do the preprocessing of all the examples in a batch at once - loading it all in memory etc. My first debugging step would be to disable the parallelization by setting it to False. The next one (if the previous wouldn't help) would be to disable it in code just to be 100% sure it doesn't end up in the "parallel" branch here:

if params['parallelize']:
  audios = pool.map(
    load_wave_fn,
    buffer
  )
else:
  audios = map(
    load_wave_fn,
    buffer
  )

If this starts to work, then you can either decrease the batch_size or adjust this function somehow, to keep your bigger batch size (which is way better for the training - making it much stabler and faster) while decreasing the number of parallel memory allocations. You could find a middle ground between sequential and totally parallel by doing parts in parallel... sequentially.

I'd also closely monitor the system when this code runs with e. g. htop so you can see in real-time if it hits the memory limit.

Hope it helps!

Honghe commented 4 years ago

It seems each epochs takes too long time, 120 seconds per batch_size=16, and only one CPU core's utilisation is 100% most of the time.

experiment(
    dataset_params(
        parallelize=True,
        batch_size=16,
        epochs=10000,
        max_wave_length=320000,
        augment=False,
        random_noise=0.75,
        random_noise_factor_min=0.1,
        random_noise_factor_max=0.15,
        random_stretch_min=0.8,
        random_stretch_max=1.2
    ),
    codename='deep_max_20_seconds',
    alphabet = pinyin_table,
    causal_convolutions=False,
    stack_dilation_rates=[1, 3, 9, 27],
    stacks=6,
    stack_kernel_size=7,
    stack_filters=3*128,
    n_fft=160*8,
    frame_step=160*4,
    num_mel_bins=160,
    optimizer='Momentum',
    lr=0.00001,
    clip_gradients=20.0
)

INFO:tensorflow:Using config: {'_model_dir': 'stats/causal_convolutions_False/codename_deep_max_20_seconds/frame_step_640/lower_edge_hertz_0/n_fft_1280/num_mel_bins_160/optimizer_Momentum/sampling_rate_16000/stack_dilation_rates_1_3_9_27/stack_filters_384/stack_kernel_size_7/stacks_6/upper_edge_hertz_8000', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': gpu_options {
  per_process_gpu_memory_fraction: 0.5
  allow_growth: true
}
allow_soft_placement: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f3af4242320>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Not using Distribute Coordinator.
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 600.
{'parallelize': True, 'shuffle': True, 'max_text_length': None, 'min_text_length': None, 'max_wave_length': 320000, 'random_shift_min': -4000, 'random_shift_max': 4000, 'random_stretch_min': 0.8, 'random_stretch_max': 1.2, 'random_noise': 0.75, 'random_noise_factor_min': 0.1, 'random_noise_factor_max': 0.15, 'epochs': 10000, 'batch_size': 16, 'augment': False}
Constraining dataset to the max_wave_length
Resulting dataset length: 2071
INFO:tensorflow:Calling model_fn.
Is training? True
logits: Tensor("speech_net/add:0", shape=(?, ?, 1836), dtype=float32)
lengths: Tensor("Cast_1:0", shape=(?,), dtype=int32)
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/metrics_impl.py:363: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from stats/causal_convolutions_False/codename_deep_max_20_seconds/frame_step_640/lower_edge_hertz_0/n_fft_1280/num_mel_bins_160/optimizer_Momentum/sampling_rate_16000/stack_dilation_rates_1_3_9_27/stack_filters_384/stack_kernel_size_7/stacks_6/upper_edge_hertz_8000/model.ckpt-15291
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py:1070: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file utilities to get mtimes.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 15291 into stats/causal_convolutions_False/codename_deep_max_20_seconds/frame_step_640/lower_edge_hertz_0/n_fft_1280/num_mel_bins_160/optimizer_Momentum/sampling_rate_16000/stack_dilation_rates_1_3_9_27/stack_filters_384/stack_kernel_size_7/stacks_6/upper_edge_hertz_8000/model.ckpt.
INFO:tensorflow:loss = 140.1194, step = 15291
INFO:tensorflow:global_step/sec: 0.859657
INFO:tensorflow:loss = 122.83241, step = 15391 (116.327 sec)
INFO:tensorflow:global_step/sec: 0.80614
INFO:tensorflow:loss = 134.20001, step = 15491 (124.049 sec)
INFO:tensorflow:global_step/sec: 0.999378
INFO:tensorflow:loss = 117.942604, step = 15591 (100.062 sec)
INFO:tensorflow:global_step/sec: 0.964646
INFO:tensorflow:loss = 117.81976, step = 15691 (103.665 sec)
INFO:tensorflow:Saving checkpoints for 15792 into stats/causal_convolutions_False/codename_deep_max_20_seconds/frame_step_640/lower_edge_hertz_0/n_fft_1280/num_mel_bins_160/optimizer_Momentum/sampling_rate_16000/stack_dilation_rates_1_3_9_27/stack_filters_384/stack_kernel_size_7/stacks_6/upper_edge_hertz_8000/model.ckpt.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py:966: remove_checkpoint (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to delete files with this prefix.
{'parallelize': True, 'shuffle': True, 'max_text_length': None, 'min_text_length': None, 'max_wave_length': 320000, 'random_shift_min': -4000, 'random_shift_max': 4000, 'random_stretch_min': 0.8, 'random_stretch_max': 1.2, 'random_noise': 0.75, 'random_noise_factor_min': 0.1, 'random_noise_factor_max': 0.15, 'epochs': 1, 'batch_size': 16, 'augment': False}
Constraining dataset to the max_wave_length
Resulting dataset length: 10
INFO:tensorflow:Calling model_fn.
Is training? False
logits: Tensor("speech_net/add:0", shape=(?, ?, 1836), dtype=float32)
lengths: Tensor("Cast_1:0", shape=(?,), dtype=int32)
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-11-25T14:57:37Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from stats/causal_convolutions_False/codename_deep_max_20_seconds/frame_step_640/lower_edge_hertz_0/n_fft_1280/num_mel_bins_160/optimizer_Momentum/sampling_rate_16000/stack_dilation_rates_1_3_9_27/stack_filters_384/stack_kernel_size_7/stacks_6/upper_edge_hertz_8000/model.ckpt-15792
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2019-11-25-14:57:40
INFO:tensorflow:Saving dict for global step 15792: edit_distance = 0.0, global_step = 15792, loss = 0.0
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 15792: stats/causal_convolutions_False/codename_deep_max_20_seconds/frame_step_640/lower_edge_hertz_0/n_fft_1280/num_mel_bins_160/optimizer_Momentum/sampling_rate_16000/stack_dilation_rates_1_3_9_27/stack_filters_384/stack_kernel_size_7/stacks_6/upper_edge_hertz_8000/model.ckpt-15792
INFO:tensorflow:Loading best metric from event files.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/summary/summary_iterator.py:68: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`
INFO:tensorflow:global_step/sec: 0.807006
INFO:tensorflow:loss = 123.38585, step = 15791 (123.914 sec)
INFO:tensorflow:global_step/sec: 1.02712
INFO:tensorflow:loss = 97.64815, step = 15891 (97.360 sec)
INFO:tensorflow:global_step/sec: 0.907419
INFO:tensorflow:loss = 124.67094, step = 15991 (110.204 sec)
INFO:tensorflow:global_step/sec: 0.971523
INFO:tensorflow:loss = 110.284996, step = 16091 (102.931 sec)
INFO:tensorflow:global_step/sec: 0.872811
INFO:tensorflow:loss = 116.599594, step = 16191 (114.572 sec)
INFO:tensorflow:global_step/sec: 0.931521
INFO:tensorflow:loss = 118.370834, step = 16291 (107.351 sec)
INFO:tensorflow:Saving checkpoints for 16392 into stats/causal_convolutions_False/codename_deep_max_20_seconds/frame_step_640/lower_edge_hertz_0/n_fft_1280/num_mel_bins_160/optimizer_Momentum/sampling_rate_16000/stack_dilation_rates_1_3_9_27/stack_filters_384/stack_kernel_size_7/stacks_6/upper_edge_hertz_8000/model.ckpt.
INFO:tensorflow:Skip the current checkpoint eval due to throttle secs (1800 secs).
INFO:tensorflow:global_step/sec: 0.862026
INFO:tensorflow:loss = 123.44051, step = 16391 (116.004 sec)
INFO:tensorflow:global_step/sec: 0.942657
INFO:tensorflow:loss = 109.44376, step = 16491 (106.084 sec)
INFO:tensorflow:global_step/sec: 0.806579
INFO:tensorflow:loss = 146.56006, step = 16591 (123.981 sec)
INFO:tensorflow:global_step/sec: 0.766367
INFO:tensorflow:loss = 158.31076, step = 16691 (130.485 sec)
INFO:tensorflow:global_step/sec: 0.908245
INFO:tensorflow:loss = 117.88295, step = 16791 (110.102 sec)
INFO:tensorflow:global_step/sec: 0.916414
INFO:tensorflow:loss = 133.88837, step = 16891 (109.122 sec)
INFO:tensorflow:Saving checkpoints for 16986 into stats/causal_convolutions_False/codename_deep_max_20_seconds/frame_step_640/lower_edge_hertz_0/n_fft_1280/num_mel_bins_160/optimizer_Momentum/sampling_rate_16000/stack_dilation_rates_1_3_9_27/stack_filters_384/stack_kernel_size_7/stacks_6/upper_edge_hertz_8000/model.ckpt.
INFO:tensorflow:Skip the current checkpoint eval due to throttle secs (1800 secs).
INFO:tensorflow:global_step/sec: 0.928797
INFO:tensorflow:loss = 118.173836, step = 16991 (107.666 sec)
INFO:tensorflow:global_step/sec: 0.852818
INFO:tensorflow:loss = 137.09431, step = 17091 (117.257 sec)
INFO:tensorflow:global_step/sec: 0.754881
INFO:tensorflow:loss = 154.90347, step = 17191 (132.472 sec)
INFO:tensorflow:global_step/sec: 0.904402
INFO:tensorflow:loss = 118.48282, step = 17291 (110.571 sec)
INFO:tensorflow:global_step/sec: 0.858431
INFO:tensorflow:loss = 128.09567, step = 17391 (116.491 sec)
INFO:tensorflow:Saving checkpoints for 17492 into stats/causal_convolutions_False/codename_deep_max_20_seconds/frame_step_640/lower_edge_hertz_0/n_fft_1280/num_mel_bins_160/optimizer_Momentum/sampling_rate_16000/stack_dilation_rates_1_3_9_27/stack_filters_384/stack_kernel_size_7/stacks_6/upper_edge_hertz_8000/model.ckpt.
{'parallelize': True, 'shuffle': True, 'max_text_length': None, 'min_text_length': None, 'max_wave_length': 320000, 'random_shift_min': -4000, 'random_shift_max': 4000, 'random_stretch_min': 0.8, 'random_stretch_max': 1.2, 'random_noise': 0.75, 'random_noise_factor_min': 0.1, 'random_noise_factor_max': 0.15, 'epochs': 1, 'batch_size': 16, 'augment': False}
Constraining dataset to the max_wave_length
Resulting dataset length: 10
INFO:tensorflow:Calling model_fn.
Is training? False
logits: Tensor("speech_net/add:0", shape=(?, ?, 1836), dtype=float32)
lengths: Tensor("Cast_1:0", shape=(?,), dtype=int32)
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-11-25T15:29:58Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from stats/causal_convolutions_False/codename_deep_max_20_seconds/frame_step_640/lower_edge_hertz_0/n_fft_1280/num_mel_bins_160/optimizer_Momentum/sampling_rate_16000/stack_dilation_rates_1_3_9_27/stack_filters_384/stack_kernel_size_7/stacks_6/upper_edge_hertz_8000/model.ckpt-17492
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2019-11-25-15:30:01
INFO:tensorflow:Saving dict for global step 17492: edit_distance = 0.0, global_step = 17492, loss = 0.0
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 17492: stats/causal_convolutions_False/codename_deep_max_20_seconds/frame_step_640/lower_edge_hertz_0/n_fft_1280/num_mel_bins_160/optimizer_Momentum/sampling_rate_16000/stack_dilation_rates_1_3_9_27/stack_filters_384/stack_kernel_size_7/stacks_6/upper_edge_hertz_8000/model.ckpt-17492
INFO:tensorflow:global_step/sec: 0.784692
INFO:tensorflow:loss = 127.08783, step = 17491 (127.439 sec)
INFO:tensorflow:global_step/sec: 0.940338
INFO:tensorflow:loss = 115.03967, step = 17591 (106.344 sec)
INFO:tensorflow:global_step/sec: 0.933841
INFO:tensorflow:loss = 111.89447, step = 17691 (107.086 sec)
INFO:tensorflow:global_step/sec: 0.817008
INFO:tensorflow:loss = 133.90723, step = 17791 (122.397 sec)
INFO:tensorflow:global_step/sec: 0.89028
INFO:tensorflow:loss = 122.81941, step = 17891 (112.325 sec)
INFO:tensorflow:global_step/sec: 1.09834
INFO:tensorflow:loss = 90.83808, step = 17991 (91.046 sec)

kamilc commented 4 years ago

@Honghe it definitely seems that the image loading is not being done in parallel (although I can see you have the parallel=True. The sequential load is the bottleneck here.

did you change anything in the input_fn or any other part of the code?
could you try using pdb or print debug messages inside the generator_fn to make sure it really goes into the "parallel" branch of the if statement?
could you run a simple test to see if the multiprocessing works the way we'd expect?

You'd create a pool by hand and spin up e. g. 16 background processes (you could make them compute something meaningless just to see if all the CPUs are utilized).

This would look like the following:

from multiprocessing import Pool

def process(i):
  a = 1
  while True:
    a *= 2
    a %= 4294967296

pool = Pool()
inputs = range(1,16)

pool.map(process, inputs)

If all works as it should you should see all your CPUs utilized.

Honghe commented 4 years ago

multiprocessing is OK. Most of the time is spent at INFO:tensorflow:Saving checkpoints..., e.g. more than one minute, which one CPU core is 100% used.

INFO:tensorflow:Using config: {'_model_dir': 'stats/causal_convolutions_False/codename_deep_max_20_seconds/frame_step_640/lower_edge_hertz_0/n_fft_1280/num_mel_bins_160/optimizer_Momentum/sampling_rate_16000/stack_dilation_rates_1_3_9_27/stack_filters_384/stack_kernel_size_7/stacks_6/upper_edge_hertz_8000', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': gpu_options {
  per_process_gpu_memory_fraction: 0.5
  allow_growth: true
}
allow_soft_placement: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f8ba46bd6d8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Not using Distribute Coordinator.
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 600.
{'parallelize': True, 'shuffle': True, 'max_text_length': None, 'min_text_length': None, 'max_wave_length': 320000, 'random_shift_min': -4000, 'random_shift_max': 4000, 'random_stretch_min': 0.8, 'random_stretch_max': 1.2, 'random_noise': 0.75, 'random_noise_factor_min': 0.1, 'random_noise_factor_max': 0.15, 'epochs': 10000, 'batch_size': 48, 'augment': True}
Constraining dataset to the max_wave_length
Resulting dataset length: 2071
INFO:tensorflow:Calling model_fn.
Is training? True
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/metrics_impl.py:363: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from stats/causal_convolutions_False/codename_deep_max_20_seconds/frame_step_640/lower_edge_hertz_0/n_fft_1280/num_mel_bins_160/optimizer_Momentum/sampling_rate_16000/stack_dilation_rates_1_3_9_27/stack_filters_384/stack_kernel_size_7/stacks_6/upper_edge_hertz_8000/model.ckpt-67542
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py:1070: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file utilities to get mtimes.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 67542 into stats/causal_convolutions_False/codename_deep_max_20_seconds/frame_step_640/lower_edge_hertz_0/n_fft_1280/num_mel_bins_160/optimizer_Momentum/sampling_rate_16000/stack_dilation_rates_1_3_9_27/stack_filters_384/stack_kernel_size_7/stacks_6/upper_edge_hertz_8000/model.ckpt.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py:966: remove_checkpoint (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to delete files with this prefix.

f16hari commented 4 years ago

I used the saved model to verify the result but the output is completely gibrish. this is the code that I used.

import tensorflow as tf import librosa import json

import tensorflow.compat.v1 as tf1

loaded = tf.saved_model.load("C:/Users/HARI/Desktop/FinalYearProject/TensorFlowLite/vv/1546646971") print(list(loaded.signatures.keys())) infer = loaded.signatures["serving_default"] print(infer.structured_outputs) SAMPLINGRATE = 16000 wave, = librosa.load("audio.wav", sr=SAMPLING_RATE) output = infer(audio=tf.constant(wave),length=tf.constant(len(wave)))

print(output)

can you please attach your final model?? please this is a request.

Mahdhir commented 3 years ago

Although the saved model was updated, I am still facing the issue of gibberish output.

import tensorflow as tf
import librosa
import json

import tensorflow.compat.v1 as tf1

loaded = tf.saved_model.load("./1546646971")
print(list(loaded.signatures.keys()))
infer = loaded.signatures["serving_default"]
SAMPLING_RATE = 16000
wave, _ = librosa.load("test-me.m4a", sr=SAMPLING_RATE)
print(infer.structured_outputs)
wv = tf.constant(len(wave),shape=(1, ),dtype=tf.int32)
output = infer(audio=tf.constant(wave),length=wv)

EndPointCorp / end-point-blog

Comments for Speech Recognition from scratch using Dilated Convolutions and CTC in Tensorflow #1481