aalto-speech / subword-kaldi

Properly handle position-dependent phones in a subword lexicon FST
MIT License
31 stars 3 forks source link

How to generate data/subword_dict #5

Open chitralekhabhat opened 2 years ago

chitralekhabhat commented 2 years ago

Hi,

I am trying to use the subword units along Kaldi librispeech recipe. I have used the code snippet mentioned in the README in the stage 3 of librispeech recipe.

if [ $stage -le 3 ]; then

   local/prepare_dict.sh --stage 3 --nj 30 --cmd "$train_cmd" data/local/lm data/local/lm data/subword_dict

   utils/prepare_lang.sh --phone-symbol-table data/lang/phones.txt --num-extra-phone-disambig-syms $extra data/subword_dict   "<UNK>" data/subword_lang/local data/subword_lang

   subdir=data/subword_lang
   tmpdir=data/subword_lang/local

   local/make_lfst_wb.py $(tail -n$extra $subdir/phones/disambig.txt) < $tmpdir/lexiconp_disambig.txt | fstcompile  --isymbols=$subdir/phones.txt --osymbols=$subdir/words.txt --keep_isymbols=false --keep_osymbols=false | fstaddselfloops  $dir/phones/wdisambig_phones.int $subdir/phones/wdisambig_words.int | fstarcsort --sort_type=olabel > $subdir/L_disambig.fst
fi

Please let me know if I need to prepare the data/subword_dict separately or if this is correct. Currently I get the below error

FATAL: FstCompiler: Symbol "<w>" is not mapped to any integer arc olabel, symbol table = data/subword_lang/words.txt, source = standard input, line = 1
ERROR: FstHeader::Read: Bad FST header: -
ERROR (fstaddselfloops[5.5.971~1-07043]:ReadFstKaldi():kaldi-fst-io.cc:35) Reading FST: error reading FST header from standard input

[ Stack-Trace: ]
/home/chitralekha/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0xb42) [0x7f81e210b742]
fstaddselfloops(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x21) [0x557e2ee630cf]
/home/chitralekha/kaldi/src/lib/libkaldi-fstext.so(fst::ReadFstKaldi(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0x1ba) [0x7f81e25685db]
fstaddselfloops(main+0x123) [0x557e2ee62afd]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f81e1794bf7]
fstaddselfloops(_start+0x2a) [0x557e2ee628fa]

kaldi::KaldiFatalErrorERROR: FstHeader::Read: Bad FST header: standard input
Traceback (most recent call last):
  File "local/make_lfst_wb.py", line 65, in <module>
    print_word(word, phones, False, True, 3, 0)
  File "local/make_lfst_wb.py", line 40, in print_word
    print("{}\t{}\t{}\t{}".format(cur_state,next_state,phones[0],word))
BrokenPipeError: [Errno 32] Broken pipe
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='ANSI_X3.4-1968'>
BrokenPipeError: [Errno 32] Broken pipe
Gastron commented 2 years ago

Looks like the error arises from not having a pronunciation for in your lexicon.

TruthLoveLife commented 2 years ago

Hi did you resolve your issue? I am also getting the same error,and I do not konw how to solve it. my code "utils/prepare_lang.sh" not use "--phone-symbol-table" ,like: utils/prepare_lang.sh --num-extra-phone-disambig-syms $extra data/subword_dict "" data/subword_lang/local data/subword_lang || exit 1;

Hi,

I am trying to use the subword units along Kaldi librispeech recipe. I have used the code snippet mentioned in the README in the stage 3 of librispeech recipe.

if [ $stage -le 3 ]; then

   local/prepare_dict.sh --stage 3 --nj 30 --cmd "$train_cmd" data/local/lm data/local/lm data/subword_dict

   utils/prepare_lang.sh --phone-symbol-table data/lang/phones.txt --num-extra-phone-disambig-syms $extra data/subword_dict   "<UNK>" data/subword_lang/local data/subword_lang

   subdir=data/subword_lang
   tmpdir=data/subword_lang/local

   local/make_lfst_wb.py $(tail -n$extra $subdir/phones/disambig.txt) < $tmpdir/lexiconp_disambig.txt | fstcompile  --isymbols=$subdir/phones.txt --osymbols=$subdir/words.txt --keep_isymbols=false --keep_osymbols=false | fstaddselfloops  $dir/phones/wdisambig_phones.int $subdir/phones/wdisambig_words.int | fstarcsort --sort_type=olabel > $subdir/L_disambig.fst
fi

Please let me know if I need to prepare the data/subword_dict separately or if this is correct. Currently I get the below error

FATAL: FstCompiler: Symbol "<w>" is not mapped to any integer arc olabel, symbol table = data/subword_lang/words.txt, source = standard input, line = 1
ERROR: FstHeader::Read: Bad FST header: -
ERROR (fstaddselfloops[5.5.971~1-07043]:ReadFstKaldi():kaldi-fst-io.cc:35) Reading FST: error reading FST header from standard input

[ Stack-Trace: ]
/home/chitralekha/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0xb42) [0x7f81e210b742]
fstaddselfloops(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x21) [0x557e2ee630cf]
/home/chitralekha/kaldi/src/lib/libkaldi-fstext.so(fst::ReadFstKaldi(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0x1ba) [0x7f81e25685db]
fstaddselfloops(main+0x123) [0x557e2ee62afd]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f81e1794bf7]
fstaddselfloops(_start+0x2a) [0x557e2ee628fa]

kaldi::KaldiFatalErrorERROR: FstHeader::Read: Bad FST header: standard input
Traceback (most recent call last):
  File "local/make_lfst_wb.py", line 65, in <module>
    print_word(word, phones, False, True, 3, 0)
  File "local/make_lfst_wb.py", line 40, in print_word
    print("{}\t{}\t{}\t{}".format(cur_state,next_state,phones[0],word))
BrokenPipeError: [Errno 32] Broken pipe
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='ANSI_X3.4-1968'>
BrokenPipeError: [Errno 32] Broken pipe