AdolfVonKleist / Phonetisaurus

Phonetisaurus G2P
BSD 3-Clause "New" or "Revised" License
449 stars 122 forks source link

Alignment outputs are not as expected #32

Open xanguera opened 7 years ago

xanguera commented 7 years ago

Hi, I am using phonetisaurus to align a a grapheme input to its phonetic transcription. For this I use the phonetisaurus-align tool with alignment models trained on CMUDict. I a few cases I see that the output does not match with the input, see for example: input to the aligner: OVERAWE OW1 V ER0 AA2 Output from the aligner: O}OW1 V}V E} R} A} E}

I had to go around it by computing how many phonemes and graphemes I had in the input and output and do something else if it does not match, but I was wondering if it would not be possible/advisable that phonetisaurus could raise an error/warning in these cases. Currently it exists normally, without any sign that an issue occurred.

Thanks!

AdolfVonKleist commented 7 years ago

This looks quite strange. I’ll take a look and get back to you.

Can you let me know what version or revision you are working with?

Thanks! Joe

Sent from my iPhone

On 7 Oct 2017, at 04:32, Xavier Anguera notifications@github.com wrote:

Hi, I am using phonetisaurus to align a a grapheme input to its phonetic transcription. For this I use the phonetisaurus-align tool with alignment models trained on CMUDict. I a few cases I see that the output does not match with the input, see for example: input to the aligner: OVERAWE OW1 V ER0 AA2 Output from the aligner: O}OW1 V}V E} R} A} E}

I had to go around it by computing how many phonemes and graphemes I had in the input and output and do something else if it does not match, but I was wondering if it would not be possible/advisable that phonetisaurus could raise an error/warning in these cases. Currently it exists normally, without any sign that an issue occurred.

Thanks!

― You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

AdolfVonKleist commented 7 years ago

Can you also share the version of the cmudict that you are using, or a link to the revision in their corresponding repo?

I cannot find the example word you shared in any recent revision I have handy. In theory this should not be possible; the aligner builds a lattice for each entry, and the provided example does not look like the result of a valid path terminating in a valid final state. It looks like part of the pronunciation may have been truncating during read - maybe space/tab related?

I tried to reproduce similar behavior with the latest version of the aligner in master, and the latest version of the cmudict:

$ wget https://raw.githubusercontent.com/cmusphinx/cmudict/master/cmudict.dict
$ cat cmudict.dict   | perl -pe 's/\([0-9]+\)//;
              s/\s+/ /g; s/^\s+//;
              s/\s+$//; @_ = split (/\s+/);
              $w = shift (@_);
              $_ = $w."\t".join (" ", @_)."\n";'   > cmudict.formatted.dict
$ phonetisaurus-train --lexicon cmudict.formatted.dict --seq2_del

I wrote the following script which I think performs the comparison you described:

#!/usr/bin/env python
import re, sys, os
from collections import defaultdict

def ProcessAligned (corpusfile, lexicon) :
    with open (corpusfile, "r") as ifp :
        for line in ifp :
            graphs = []; phones = []
            tokens = re.split (ur"\s+", line.decode ("utf8").strip ())
            for token in tokens :
                g,p = re.split (ur"\}", token)
                graphs.extend (re.split (ur"\|", g))
                phones.extend (re.split (ur"\|", p))
            word = u"".join ([g for g in graphs if not g == u"_"])
            pron = u" ".join ([p for p in phones if not p == u"_"])
            prons = lexicon [word]
            if not pron in prons :
                entry = u"{0}\t{1}".format (word, pron)
                print entry.encode ("utf8")
    return

def LoadLexicon (lexiconfile) :
    lexicon = defaultdict (list)
    with open (lexiconfile, "r") as ifp :
        for entry in ifp :
            word, pron = re.split (ur"\t", entry.decode ("utf8").strip ())
            lexicon [word].append (pron)

    return lexicon

if __name__ == "__main__" :
    import argparse

    lexicon = LoadLexicon (sys.argv [1])
    ProcessAligned (sys.argv [2], lexicon)

when I run it against the reference lexicon and resulting aligned corpus:

$ python proc.py ../cmudict.formatted.dict model.corpus
$

all pronunciations from the original are found. This again makes me think that it may be an issue related to spaces in the read in lexicon. Lemme know!

xanguera commented 7 years ago

I am using git revision 195f31-dirty The word OVERAWE is not in CMUDict, I computed its transcription using Phonetisaurus' G2P model trained on CMUDict, and then I tried to align graphemes to phonemes, unsuccessfully, as you can see.

thanks!

On Fri, Oct 6, 2017 at 11:02 PM, Josef Novak notifications@github.com wrote:

This looks quite strange. I’ll take a look and get back to you.

Can you let me know what version or revision you are working with?

Thanks! Joe

Sent from my iPhone

On 7 Oct 2017, at 04:32, Xavier Anguera notifications@github.com wrote:

Hi, I am using phonetisaurus to align a a grapheme input to its phonetic transcription. For this I use the phonetisaurus-align tool with alignment models trained on CMUDict. I a few cases I see that the output does not match with the input, see for example: input to the aligner: OVERAWE OW1 V ER0 AA2 Output from the aligner: O}OW1 V}V E} R} A} E}

I had to go around it by computing how many phonemes and graphemes I had in the input and output and do something else if it does not match, but I was wondering if it would not be possible/advisable that phonetisaurus could raise an error/warning in these cases. Currently it exists normally, without any sign that an issue occurred.

Thanks!

― You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/AdolfVonKleist/Phonetisaurus/issues/32#issuecomment-334880783, or mute the thread https://github.com/notifications/unsubscribe-auth/AJE6_D-JakCEdXTgX1BQrfEzXNulpgkoks5spqNwgaJpZM4Pw76s .

AdolfVonKleist commented 7 years ago

Ah OK, I did not quite understand at first. Can you use the python bindings or the script interface directly? This will actually provide back the original alignment from the decoding step, and will also retain the arc weights from the joint sequence LM, including backoff epsilon arcs:

The python bindings/script interface provide back the following result in my case:

$ ./script/phoneticize.py --model /tmp/experiment/train/model.fst --word overawe
0.00    OW1 V ER0 AA1
-------
o:OW1:5.37
v:V:0.84
e|r:ER0:0.06
<eps>:<eps>:1.85
<eps>:<eps>:0.49
a:AA1:5.51
<eps>:<eps>:0.29
w:_:4.21
<eps>:<eps>:0.29
e:_:2.63
<eps>:<eps>:1.06
xanguera commented 7 years ago

Hi, I am not interested in getting transcription for some of the words, as some are entered manually by the user, although I do ned to have alignments for all. Why would the python wrapper behave differently from the executable? In any case, I wrote a simple script to detect when there are alignment issues and right now I am discarting them, so that they do not break my pipeline.

Thanks

X.

On Sun, Oct 8, 2017 at 3:03 AM, Josef Novak notifications@github.com wrote:

Ah OK, I did not quite understand at first. Can you use the python bindings or the script interface directly? This will actually provide back the original alignment from the decoding step, and will also retain the arc weights from the joint sequence LM, including backoff epsilon arcs:

The python bindings/script interface provide back the following result in my case:

$ ./script/phoneticize.py --model /tmp/experiment/train/model.fst --word overawe 0.00 OW1 V ER0 AA1

o:OW1:5.37 v:V:0.84 e|r:ER0:0.06::1.85::0.49 a:AA1:5.51::0.29 w::4.21::0.29 e::2.63::1.06

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/AdolfVonKleist/Phonetisaurus/issues/32#issuecomment-334977635, or mute the thread https://github.com/notifications/unsubscribe-auth/AJE6_D3_AIC-5l11NQXPLwogYTnUYztpks5sqC11gaJpZM4Pw76s .