Kyubyong / g2p

g2p: English Grapheme To Phoneme Conversion
Apache License 2.0
810 stars 129 forks source link

For I'm, it's, I'll, you're, I've, I'd #8

Closed begeekmyfriend closed 4 years ago

begeekmyfriend commented 5 years ago
>>> g2p('It\'s')
['IH1', 'T', ' ', 'EH1', 'S'] # Should be ['IH1', 'T', ' ', 'S']
>>> g2p('I\'m')
['AY1', ' ', 'AH0', 'M'] # Should be ['AY1', ' ', 'M']
's
S
Z
'll
L
've
V
'd
D
're
R
't
T
'm
M
begeekmyfriend commented 4 years ago

Sorry, I have got something wrong. Hope it did not bother you too much...

begeekmyfriend commented 4 years ago

But wait, there are still problems in it.

It wasn't a joke, said Severson,
IH1T WAA1ZEH1NTAY1 AH0 JHOW1K , SEH1D SEH1VER0SAH0N ,
They say/ 'yin yang'%.
DHEY1 SEY1 YIH1N YAE1NG .
I'm a man.
AY1AH0M AH0 MAE1N .
But hey%, thanks for bein/' in my corner%.
BAH1T HHEY1 , THAE1NGKS FAO1R BIY1N IH0N MAY1 KAO1RNER0 .
You'll get it.
YUW1EH1L GEH1T IH1T .
I'd like to write to you.
AY1DIY1 LAY1K TUW1 RAY1T TUW1 YUW1 .
It's OK.
IH1TEH1S OW1KEY1 .
I've got it.
AY1VIY1 GAA1T IH1T .

Above all, wasn't, It's, I've and I'd still be wrong...

Kyubyong commented 4 years ago

You're right. I've corrected by changing the word tokenizer from nltk.word_tokenize to TweetTokenizer. Try again. Thanks!

begeekmyfriend commented 4 years ago

I'm glad to see it all right now. Sorry for my late response! So kind of you!

begeekmyfriend commented 4 years ago

Hi, another tiny problem. The new TweetTokenizer cannot distinguish punctuation and abbreviation as follows. The original tokenizer seems good for it.

>>> from g2p_en import G2p
>>> g2p = G2p()
>>> ''.join(g2p('8 p.m.'))
'EY1T PIY1 . EH1M .'