AdolfVonKleist / Phonetisaurus

Phonetisaurus G2P
BSD 3-Clause "New" or "Revised" License
449 stars 122 forks source link

segmentation fault with cmu openfst g2p models #18

Closed petronny closed 7 years ago

petronny commented 7 years ago

Hi, I'm using the latest phonetisaurus-g2pfst(branch 1.6.1) compiled with openfst 1.6.4 and gcc/g++ 7.1.1 And I want to use the CMU g2p model.

The output of using the original model.fst is empty(#5), so I convert the text version first.

$ fstcompile --isymbols=model.input.syms --osymbols=model.output.syms < model.fst.txt > model.fst

But I get segmentation fault here.

$ phonetisaurus-g2pfst --model=model.fst --word=test --v=3
GitRevision: 0f844f
INFO: FstImpl::ReadHeader: source: model.fst, fst_type: vector, arc_type: standard, version: 2, flags: 0
[1]    16061 segmentation fault (core dumped)  phonetisaurus-g2pfst --model=model.fst --word=test --v=3

Please help

AdolfVonKleist commented 7 years ago

You need to retrain the model. The version of OpenFst used to create those models is not compatible with anything beyond 1.3.5, if I recall correctly.

You could try using fstprint and fstsymbols to print and then recompile the models, but I think it would make more sense to just retrain with the examples using the latest version of CMUdict and the quickstart examples in the README.md file.

It shouldn't take more than 20m to retrain and recompile everything from scratch. I'll try to also do the same this weekend so there are compatible, downloadable example models for the current build.

AdolfVonKleist commented 7 years ago

I've trained a new example model using the latest version from master and the latest version of the cmudict. It is available in the downloads repository:

or you can grab it directly:

The training process for this example was exactly that described in the README.md file, namely:

$ wget https://raw.githubusercontent.com/cmusphinx/cmudict/master/cmudict.dict
# Clean it up a bit and reformat:
$ cat cmudict.dict \
  | perl -pe 's/\([0-9]+\)//; 
              s/\s+/ /g; s/^\s+//; 
              s/\s+$//; @_ = split (/\s+/); 
              $w = shift (@_); 
              $_ = $w."\t".join (" ", @_)."\n";' \
  > cmudict.formatted.dict

$ phonetisaurus_train --lexicon cmudict.formatted.dict --seq2_del
INFO::2017-07-09 16:35:31:  Checking command configuration...
INFO::2017-07-09 16:35:31:  Checking lexicon for reserved characters: '}', '|', '_'...
INFO::2017-07-09 16:35:31:  Aligning lexicon...
INFO::2017-07-09 16:37:44:  Training joint ngram model...
INFO::2017-07-09 16:37:46:  Converting ARPA format joint n-gram model to WFST format...
INFO::2017-07-09 16:37:59:  G2P training succeeded: train/model.fst

Note that you can probably build a better model than this, especially if you take a bit more care with tidying up the cmudict, but this should be as good or better than the older example, and compatible with the current version of the g2p code.