didi / athena

A release version for https://github.com/athena-team/athena
Apache License 2.0
125 stars 35 forks source link

AISHELL avg_acc is zero all the time in decoding step. #29

Closed TeaPoly closed 4 years ago

TeaPoly commented 4 years ago

Thanks for your contribution firstly. It's hard to imagine that you not only finished the paper 'IMPROVING TRANSFORMER-BASED SPEECH RECOGNITION USING UNSUPERVISED PRE-TRAINING' but also shared the code. Thank you very much.

There is a problem when I using athena project. It looks fine in "Preparing data"、"Pretraining"、"Fine-tuning" step, develop dataset has high accuracy. But when I use athena/decode_main.py with test dataset in "Decoding" step, the "avg_acc" is always zero in log messages, like this:

INFO:absl:predictions: tf.Tensor([[4233]], shape=(1, 1), dtype=int64) labels: [[ 424 2477 3491 1238 850 1284 1269]] errs: 7 avg_acc: 0.0000 sec/iter: 0.3383

My script file is modified from "hkust" exmaple, here is my script:

# coding=utf-8
# Copyright (C) ATHENA AUTHORS
# All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

if [ "athena" != $(basename "$PWD") ]; then
    echo "You should run this script in athena directory!!"
    exit 1
fi

source tools/env.sh

stage=3
stop_stage=3

if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
    # prepare data
    echo "Preparing data"
    python examples/asr/aishell/local/prepare_data.py /nfs/project/datasets/opensource_data/aishell
    mkdir -p examples/asr/aishell/data
    cp /nfs/project/datasets/opensource_data/aishell/{train,dev}.csv examples/asr/aishell/data/

    # cal cmvn
    cat examples/asr/aishell/data/train.csv > examples/asr/aishell/data/all.csv
    tail -n +2 examples/asr/aishell/data/dev.csv >> examples/asr/aishell/data/all.csv
    python athena/cmvn_main.py examples/asr/aishell/mpc.json examples/asr/aishell/data/all.csv
fi

if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    # pretrain stage
    echo "Pretraining"
    # we recommend training with multi-gpu, for single gpu, run "python athena/main.py examples/asr/aishell/mpc.json" instead
    horovodrun -np 4 -H localhost:4 python athena/horovod_main.py examples/asr/aishell/mpc.json
fi

if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
    # finetuning stage
    echo "Fine-tuning"
    # we recommend training with multi-gpu, for single gpu, run "python athena/main.py examples/asr/aishell/mtl_transformer.json" instead
    horovodrun -np 4 -H localhost:4 python athena/horovod_main.py examples/asr/aishell/mtl_transformer.json
fi

if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
    # decoding stage
    echo "Decoding"
    # prepare language model
    tail -n +2 examples/asr/aishell/data/train.csv | cut -f 3 > examples/asr/aishell/data/text
    python examples/asr/aishell/local/segment_word.py examples/asr/aishell/data/vocab \
       examples/asr/aishell/data/text > examples/asr/aishell/data/text.seg
    tools/kenlm/build/bin/lmplz -o 4 < examples/asr/aishell/data/text.seg > examples/asr/aishell/data/lm.bin

    python athena/decode_main.py examples/asr/aishell/mtl_transformer.json
fi

if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
    echo "training rnnlm"
    tail -n +2 examples/asr/aishell/data/train.csv | awk '{print $3"\t"$3}' > examples/asr/aishell/data/train.trans.csv
    tail -n +2 examples/asr/aishell/data/dev.csv | awk '{print $3"\t"$3}' > examples/asr/aishell/data/dev.trans.csv
    python athena/main.py examples/asr/aishell/rnnlm.json
fi

I'm not sure where is the problem.

Some-random commented 4 years ago

It does seem like a problem in beam search stage. @cookingbear please look into this

TeaPoly commented 4 years ago

Here is log: avg_acc_zero.log

cookingbear commented 4 years ago

can you share your config?

TeaPoly commented 4 years ago

can you share your config?

mtl_transformer.zip

cookingbear commented 4 years ago

can you share your config?

mtl_transformer.zip

have you ever tested the result of the dev dataset in decoding step and does it have the same problem with that of the test dataset?

TeaPoly commented 4 years ago

can you share your config?

mtl_transformer.zip

have you ever tested the result of the dev dataset in decoding step and does it have the same problem with that of the test dataset?

I have tested the dev dataset, the results in decoding step looks fine.

TeaPoly commented 4 years ago

can you share your config?

mtl_transformer.zip

have you ever tested the result of the dev dataset in decoding step and does it have the same problem with that of the test dataset?

Overfiting or CMVN issue?

cookingbear commented 4 years ago

can you share your config?

mtl_transformer.zip

have you ever tested the result of the dev dataset in decoding step and does it have the same problem with that of the test dataset?

Overfiting or CMVN issue?

The CMVN of the test dataset is missing. If you wanna run the script from the beginning, add the line below under the "# cal cmvn" line: tail -n +2 examples/asr/aishell/data/test.csv >> examples/asr/aishell/data/all.csv If you wanna run the decode_main.py, you can remove the examples/asr/aishell/data/cmvn, the system will automatically create the CMVN of the decoding dataset.

TeaPoly commented 4 years ago

can you share your config?

mtl_transformer.zip

have you ever tested the result of the dev dataset in decoding step and does it have the same problem with that of the test dataset?

Overfiting or CMVN issue?

The CMVN of the test dataset is missing. If you wanna run the script from the beginning, add the line below under the "# cal cmvn" line: tail -n +2 examples/asr/aishell/data/test.csv >> examples/asr/aishell/data/all.csv If you wanna run the decode_main.py, you can remove the examples/asr/aishell/data/cmvn, the system will automatically create the CMVN of the decoding dataset.

Get it. Thanks so much.