alex-berard / seq2seq

Attention-based sequence to sequence learning
Apache License 2.0
388 stars 122 forks source link

Using Alpha code0.1 for English audio to Russian Text translation #12

Closed Ola131v closed 6 years ago

Ola131v commented 6 years ago

Hi, I am trying to do audio in English to text in Russian translation using the Alpha release(0.1) code.

I am working on experiments/btec_speech folder of Alpha release code.

I am giving the audio and text aligned chapter wise (Each chapter contains around 1500 russian words or 50000 MFCC values ).

I have separated the chapters in such a way that 16 chapters are given for train, 2 for dev and 5 for test.

When I generate the MFCC for each set (ie, train,dev or test) I am concatenating them. ie, the First 4 bytes of the train MFCC feature file will be 16, that of dev will be 2 and that of test will be 5.

I am generating the vocab files from russian text. Please find attached the prepare.sh file that I am using.

I modified baseline-mono.yaml for training the network.

The modifications that I made in baseline-mono.yaml are 1) For encoder: name:en binary:True #since it is MFCC values max_len:False 2) For decoder: name:ru binary:False max_len:False 3) vocab_prefix: vocab

I am pasting the modified baseline-mono.yaml file.

I modified max_len field in both encoder and decoder because, the number of words in russian text for 1 chapter or number of MFCC coefficients in one frame (one frame of MFCC coefficients corresponds to one line in Russian text ) are much more than the max_seq_len already given (It was 25 and 600 ). (In function read_dataset() there is a check for max_seq_len ) So In order to avoid the checking of max_seq_len I modified in the init() of class TranslationModel as 'self.max_len=False' instead of 'self.max_len = dict(zip(self.extensions, self.max_input_len + self.max_output_len)) '.

When I try to train the network with these modifications I am getting the below mentioned error.

11/26 15:56:17 files: experiments/btec_speech/data/hbfn.dev.en experiments/btec_speech/data/hbfn.dev.ru 11/26 15:56:18 size: 2 11/26 15:56:31 starting training Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/sharvinv/work/speec_proc/seq2seq-0.1/translate/main.py", line 229, in main() File "/home/sharvinv/work/speec_proc/seq2seq-0.1/translate/main.py", line 221, in main model.train(sess=sess, config) File "/home/sharvinv/work/speec_proc/seq2seq-0.1/translate/translation_model.py", line 368, in train self.train_step(sess=sess, loss_function=loss_function, use_baseline=use_baseline, kwargs) File "/home/sharvinv/work/speec_proc/seq2seq-0.1/translate/translation_model.py", line 412, in train_step update_baseline=True) File "/home/sharvinv/work/speec_proc/seq2seq-0.1/translate/seq2seq_model.py", line 200, in step res = session.run(output_feed, input_feed) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 889, in run run_metadata_ptr) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1096, in _run % (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape()))) ValueError: Cannot feed value of shape (32, 0) for Tensor 'encoder_en:0', which has shape '(?, ?, 41)'

Could you please tell me whether I am doing it in correct way,or should I make any other modification in order to train the network to translate the English audio to Russian Text.

Thanks and Regards,

Olga Strizhko

baseline-mono.yaml label: 'baseline-mono' description: "mono-speaker baseline on BTEC"

dropout_rate: 0.5 cell_size: 256 attn_size: 256 embedding_size: 256

layers: 2 bidir: True use_lstm: True weight_scale: null

data_dir: experiments/btec_speech/data model_dir: experiments/btec_speech/model batch_size: 64

train_prefix: hbfn.train # 'easy' mono-speaker settings dev_prefix: hbfn.dev

optimizer: 'adam' learning_rate: 0.001

steps_per_checkpoint: 1000 steps_per_eval: 1000

max_gradient_norm: 5.0 max_steps: 30000 batch_mode: 'standard' read_ahead: 10 vocab_prefix: vocab encoders:

decoders:

prepare.sh

raw_data_dir=data/raw/btec.en-ru raw_audio_dir=${raw_data_dir}/speech_en speech_dir=experiments/btec_speech data_dir=${speech_dir}/data # output directory for the processed files (text and audio features)

mkdir -p ${raw_audio_dir} ${data_dir}

scripts/speech/extract-audio-features.py ${raw_audio_dir}/hbfn_wav16_en/train/ --output ${data_dir}/hbfn.train.en scripts/speech/extract-audio-features.py ${raw_audio_dir}/hbfn_wav16_en/dev/ --output ${data_dir}/hbfn.dev.en scripts/speech/extract-audio-features.py ${raw_audio_dir}/hbfn_wav16_en/test/* --output ${data_dir}/hbfn.test.en

scripts/prepare-data.py ${data_dir}/hbfn.train ru ${data_dir} --max 0 --lowercase --output vocab --mode vocab

alex-berard commented 6 years ago

Hello,

The error you're getting is due to your change inside "translation_model.py". You don't need to change this line. Setting "max_len" to 0 inside the config files is enough.

However, I'm sorry to tell you this, but there is no way you'll be able to train the model with full chapters of length 50000. I'm already having trouble because of memory constraints with sequences of length 1500 (the longer the sequence, the more GPU memory seq2seq requires).

Moreover, having so few samples (e.g.. 5 samples for test) is not a realistic setting. For example, BTEC (which is a very small corpus by deep learning standards) has 2000/1500/900 samples (for train/dev/test).

You'll need to find a way to split your paragraphs into smaller segments (e.g., sentences)

Alexandre