Using Alpha code0.1 for English audio to Russian Text translation

Hi, I am trying to do audio in English to text in Russian translation using the Alpha release(0.1) code.

I am working on experiments/btec_speech folder of Alpha release code.

I am giving the audio and text aligned chapter wise (Each chapter contains around 1500 russian words or 50000 MFCC values ).

I have separated the chapters in such a way that 16 chapters are given for train, 2 for dev and 5 for test.

When I generate the MFCC for each set (ie, train,dev or test) I am concatenating them. ie, the First 4 bytes of the train MFCC feature file will be 16, that of dev will be 2 and that of test will be 5.

I am generating the vocab files from russian text. Please find attached the prepare.sh file that I am using.

I modified baseline-mono.yaml for training the network.

The modifications that I made in baseline-mono.yaml are 1) For encoder: name:en binary:True #since it is MFCC values max_len:False 2) For decoder: name:ru binary:False max_len:False 3) vocab_prefix: vocab

I am pasting the modified baseline-mono.yaml file.

I modified max_len field in both encoder and decoder because, the number of words in russian text for 1 chapter or number of MFCC coefficients in one frame (one frame of MFCC coefficients corresponds to one line in Russian text ) are much more than the max_seq_len already given (It was 25 and 600 ). (In function read_dataset() there is a check for max_seq_len ) So In order to avoid the checking of max_seq_len I modified in the init() of class TranslationModel as 'self.max_len=False' instead of 'self.max_len = dict(zip(self.extensions, self.max_input_len + self.max_output_len)) '.

When I try to train the network with these modifications I am getting the below mentioned error.

11/26 15:56:17 files: experiments/btec_speech/data/hbfn.dev.en experiments/btec_speech/data/hbfn.dev.ru 11/26 15:56:18 size: 2 11/26 15:56:31 starting training Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/sharvinv/work/speec_proc/seq2seq-0.1/translate/main.py", line 229, in main() File "/home/sharvinv/work/speec_proc/seq2seq-0.1/translate/main.py", line 221, in main model.train(sess=sess, config) File "/home/sharvinv/work/speec_proc/seq2seq-0.1/translate/translation_model.py", line 368, in train self.train_step(sess=sess, loss_function=loss_function, use_baseline=use_baseline, kwargs) File "/home/sharvinv/work/speec_proc/seq2seq-0.1/translate/translation_model.py", line 412, in train_step update_baseline=True) File "/home/sharvinv/work/speec_proc/seq2seq-0.1/translate/seq2seq_model.py", line 200, in step res = session.run(output_feed, input_feed) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 889, in run run_metadata_ptr) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1096, in _run % (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape()))) ValueError: Cannot feed value of shape (32, 0) for Tensor 'encoder_en:0', which has shape '(?, ?, 41)'

Could you please tell me whether I am doing it in correct way,or should I make any other modification in order to train the network to translate the English audio to Russian Text.

Thanks and Regards,

Olga Strizhko

baseline-mono.yaml label: 'baseline-mono' description: "mono-speaker baseline on BTEC"

dropout_rate: 0.5 cell_size: 256 attn_size: 256 embedding_size: 256

layers: 2 bidir: True use_lstm: True weight_scale: null

data_dir: experiments/btec_speech/data model_dir: experiments/btec_speech/model batch_size: 64

train_prefix: hbfn.train # 'easy' mono-speaker settings dev_prefix: hbfn.dev

optimizer: 'adam' learning_rate: 0.001

steps_per_checkpoint: 1000 steps_per_eval: 1000

max_gradient_norm: 5.0 max_steps: 30000 batch_mode: 'standard' read_ahead: 10 vocab_prefix: vocab encoders:

name: en embedding_size: 41 layers: 3 time_pooling: [2, 2] pooling_avg: True binary: True attention_filters: 1 attention_filter_length: 25 max_len: False input_layers: [256, 256] concat_last_states: True bidir_projection: True trainable_initial_states: False

decoders:

name: ru layers: 2 max_len: False maxout: False input_attention: False use_previous_word: False vanilla: False state_zero: True use_lstm_state: False output_extra_proj: False attn_prev_word: False maxout_stride: null convolutions: null

prepare.sh

raw_data_dir=data/raw/btec.en-ru raw_audio_dir=${raw_data_dir}/speech_en speech_dir=experiments/btec_speech data_dir=${speech_dir}/data # output directory for the processed files (text and audio features)

mkdir -p ${raw_audio_dir} ${data_dir}

scripts/speech/extract-audio-features.py ${raw_audio_dir}/hbfn_wav16_en/train/ --output ${data_dir}/hbfn.train.en scripts/speech/extract-audio-features.py ${raw_audio_dir}/hbfn_wav16_en/dev/ --output ${data_dir}/hbfn.dev.en scripts/speech/extract-audio-features.py ${raw_audio_dir}/hbfn_wav16_en/test/* --output ${data_dir}/hbfn.test.en

scripts/prepare-data.py ${data_dir}/hbfn.train ru ${data_dir} --max 0 --lowercase --output vocab --mode vocab

alex-berard / seq2seq

Using Alpha code0.1 for English audio to Russian Text translation #12