Open sauravjoshi opened 4 years ago
I have a similar question for @mohitsshah: I am trying to fine-tune your model using a new dataset for example Librispeech. However, when I try to to generate Librispeech data and continue training with your provided weights, obtained results are completely wrong and doesn't make sense. I am using the following script to continue training the model:
DATA_DIR=/media/disk3/Voice2text/t2t_data/ TMP_DIR=/media/disk3/Voice2text/t2t_datagen/ TRAIN_DIR=/media/disk3/Voice2text/t2t_train/librispeech_english/
PROBLEM=at16k_subword
python /media/disk3/Voice2text/env/lib/python3.6/site-packages/tensor2tensor/bin/t2t_trainer.py \ --t2t_usr_dir=/media/disk3/Voice2text/ \ --data_dir=$DATA_DIR \ --output_dir=$TRAIN_DIR \ --model=transformer \ --worker_gpu_memory_fraction=0.9 \ --hparams_set=transformer_librispeech_tpu \ --hparams=max_length=295650,max_input_seq_length=3650,max_target_seq_length=250 \ --train_steps=7000000 \ --problem=$PROBLEM \ --allow_growth=True
Can you provide the data generation command that you used applied for example to Librispeech dataset and containing your flags?
I'm relatively new to t2t and was studying leveraging it for ASR when I came across your work. Amazing work done @mohitsshah with proper explanation over at16k. The results are pretty impressive. I'm planning to extend the model for domain-specific approach with an overview of extending the vocab. Would like your assist onto the following.
The class At16kSubword has property of multiprocess_generate set as true this certainly means the data is being generated as several multiple processes. What was your config over this and depending upon the hours of data, what was time-spent?
Also the core generate_data and generator functions aren't defined, what was the data you build it? Did you leveraged the librispeech and added you'rs own data. Those two function definition would be required to keep in sync with the additional data I'll be fine-tuning on. Could you provide that?
The approx_vocab_size is defined as 1000 only? If our goal is to extend the vocab, and we are utilising the existing vocab which is getting used in feature_encoders(), will the new sub-words be added to this or a new vocab with additional sub-words shall be created, because as far as i know the vocab is generated in data-gen phase ?
Can you provide the data generation command that you used that would be containing the additional FLAGS. The training command used with the additional FLAGS could be provided?