Easy-to-use Speech Toolkit including Self-Supervised Learning model, SOTA/Streaming ASR with punctuation, Streaming TTS with text frontend, Speaker Verification System, End-to-End Speech Translation and Keyword Spotting. Won NAACL2022 Best Demo Award.
mkdir -p data
mkdir -p ${dict_dir}
TARGET_DIR=${MAIN_ROOT}/dataset
mkdir -p ${TARGET_DIR}
prepare data
if [ ${stage} -le -1 ] && [ ${stop_stage} -ge -1 ]; then
if [ ! -d "${MAIN_ROOT}/dataset/alphabet_data/alphabet" ]; then
echo "${MAIN_ROOT}/dataset/alphabet_data/alphabet does not exist. Please donwload alphabet data first."
exit
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
format manifest with tokenids, vocab size
for sub in train dev test ; do
{
python3 ${MAIN_ROOT}/utils/format_data.py \
--cmvn_path "data/mean_std.json" \
--unit_type "spm" \
--spm_model_prefix ${bpeprefix} \
--vocab_path="${dict_dir}/vocab.txt" \
--manifest_path="data/manifest.${sub}.raw" \
--output_path="data/manifest.${sub}"
if [ $? -ne 0 ]; then
echo "Formt mnaifest failed. Terminated."
exit 1
fi
}&
done
wait
echo "format data done."
fi
数据是自己录得英文字母拼读音频,label.txt的结构是这样的:
abcd a b c d
alter a l t e r
apple a p p l e
azxs a z x s
bacteria b a c t e r i a
bcd b c d
blast b l a s t
breed b r e e d
budget b u d g e t
burst b u r s t
campus c a m p u s
candidate c a n d i d a t e
consume c o n s u m e
dergbnm d e r g b n m
dfhgh d f h g h
dfhji d f h j i
dispose d i s p o s e
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
你好,我想训练一个英文字母识别的U2模型, 以下时我的配置: chunk_conformer.yaml: ############################################
Network Architecture
############################################ cmvn_file: cmvn_file_type: "json"
encoder related
encoder: conformer encoder_conf: output_size: 256 # dimension of attention attention_heads: 4 linear_units: 2048 # the number of units of position-wise feed forward num_blocks: 12 # the number of encoder blocks dropout_rate: 0.1 # sublayer output dropout positional_dropout_rate: 0.1 attention_dropout_rate: 0.0 input_layer: conv2d # encoder input type, you can chose conv2d, conv2d6 and conv2d8 normalize_before: True cnn_module_kernel: 15 use_cnn_module: True activation_type: 'swish' pos_enc_layer_type: 'rel_pos' selfattention_layer_type: 'rel_selfattn' causal: true use_dynamic_chunk: true cnn_module_norm: 'layer_norm' # using nn.LayerNorm makes model converge faster use_dynamic_left_chunk: false
decoder related
decoder: transformer decoder_conf: attention_heads: 4 linear_units: 2048 num_blocks: 6 dropout_rate: 0.1 # sublayer output dropout positional_dropout_rate: 0.1 self_attention_dropout_rate: 0.0 src_attention_dropout_rate: 0.0
hybrid CTC/attention
model_conf: ctc_weight: 0.3 lsm_weight: 0.1 # label smoothing option length_normalized_loss: false reverse_weight: 0.0 init_type: 'kaiming_uniform'
###########################################
Data
###########################################
train_manifest: data/manifest.train dev_manifest: data/manifest.dev test_manifest: data/manifest.test
###########################################
Dataloader
###########################################
vocab_filepath: data/lang_char/vocab.txt spm_model_prefix: 'data/lang_char/bpe_bpe_56' unit_type: 'spm' preprocess_config: conf/preprocess.yaml feat_dim: 80 stride_ms: 10.0 window_ms: 25.0 sortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs batch_size: 32 maxlen_in: 512 # if input length > maxlen-in, batchsize is automatically reduced maxlen_out: 150 # if output length > maxlen-out, batchsize is automatically reduced minibatches: 0 # for debug batch_count: auto batch_bins: 0 batch_frames_in: 0 batch_frames_out: 0 batch_frames_inout: 0 num_workers: 0 subsampling_factor: 1 num_encs: 1
###########################################
Training
########################################### n_epoch: 1000 accum_grad: 32 global_grad_clip: 5.0 dist_sampler: True optim: adam optim_conf: lr: 0.001 weight_decay: 1.0e-6 scheduler: warmuplr scheduler_conf: warmup_steps: 25000 lr_decay: 1.0 log_interval: 100 checkpoint: kbest_n: 50 latest_n: 5
data.sh
!/bin/bash
stage=-1 stop_stage=100 dict_dir=data/lang_char
bpemode (unigram or bpe)
nbpe=56 bpemode=bpe bpeprefix="${dictdir}/bpe${bpemode}_${nbpe}"
stride_ms=20 window_ms=30 sample_rate=16000 feat_dim=80
source ${MAIN_ROOT}/utils/parse_options.sh
mkdir -p data mkdir -p ${dict_dir} TARGET_DIR=${MAIN_ROOT}/dataset mkdir -p ${TARGET_DIR}
prepare data
if [ ${stage} -le -1 ] && [ ${stop_stage} -ge -1 ]; then if [ ! -d "${MAIN_ROOT}/dataset/alphabet_data/alphabet" ]; then echo "${MAIN_ROOT}/dataset/alphabet_data/alphabet does not exist. Please donwload alphabet data first." exit fi
create manifest json file
fi
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
compute mean and stddev for normalizer
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
use train_set build dict
fi
use new dict format data
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
format manifest with tokenids, vocab size
fi
数据是自己录得英文字母拼读音频,label.txt的结构是这样的: abcd a b c d alter a l t e r apple a p p l e azxs a z x s bacteria b a c t e r i a bcd b c d blast b l a s t breed b r e e d budget b u d g e t burst b u r s t campus c a m p u s candidate c a n d i d a t e consume c o n s u m e dergbnm d e r g b n m dfhgh d f h g h dfhji d f h j i dispose d i s p o s e
几乎没有正确的:
这是什么原因? 无论是spm还是char类型的都试了,也是一样的效果。 是参数配置的问题?