bootphon / phonemizer

Simple text to phones converter for multiple languages
https://bootphon.github.io/phonemizer/
GNU General Public License v3.0
1.18k stars 165 forks source link

preserve_punctuation inserting spaces around the punctuation #97

Closed jncasey closed 2 years ago

jncasey commented 2 years ago

Describe the bug It seems that regardless of backend, the preserve_punctuation option pads the punctuation with spaces when it's being reinserted. I'm not sure if this is the intended behavior or a bug, but for my purposes, I'd at least like the option to be able to get the phonemes from my text with the punctuation aligned exactly as it was in the original.

The strip option at first seemed like a solution to this problem as it removes these extra spaces, but it also removes the final phoneme separator of each word, which I want to preserve.

Phonemizer version phonemizer-3.0 available backends: espeak-ng-1.50, festival-2.5.0, segments-2.2.0 uninstalled backends: espeak-mbrola

System Ubuntu 21.04, Python 3.9

To reproduce Starting with the following as lyrics.txt:

Yes, and how many times must the cannonballs fly
Before they're forever banned?

Running phonemize --preserve-punctuation lyrics.txt returns

jɛs , ænd haʊ mɛni taɪmz mʌst ðə kænənbɔːlz flaɪ 
bᵻfoːɹ ðeɪɚ fɚɹɛvɚ bænd ?

As I mentioned in the summary, using --strip removes those extra spaces and returns a correct-looking result

phonemize --preserve-punctuation --strip lyrics.txt

jɛs, ænd haʊ mɛni taɪmz mʌst ðə kænənbɔːlz flaɪ
bᵻfoːɹ ðeɪɚ fɚɹɛvɚ bænd?

But --strip doesn't work for me since I want to use a phoneme separator, and keep all of the separators so I can tokenize the phonemes for a language model (e.g. "Yes" > "j|ɛ|s|" > ["j|", "ɛ|", "s|"])

phonemize -p "|" --preserve-punctuation --strip lyrics.txt

j|ɛ|s, æ|n|d h|aʊ m|ɛ|n|i t|aɪ|m|z m|ʌ|s|t ð|ə k|æ|n|ə|n|b|ɔː|l|z f|l|aɪ
b|ᵻ|f|oːɹ ð|eɪ|ɚ f|ɚ|ɹ|ɛ|v|ɚ b|æ|n|d?

Expected behavior This is the result I'm looking for:

phonemize -p "|" --preserve-punctuation [--strip?] lyrics.txt (possibly using some additional flags, if the current behavior is a good default and my use case is an outlier)

j|ɛ|s|, æ|n|d| h|aʊ| m|ɛ|n|i| t|aɪ|m|z| m|ʌ|s|t| ð|ə| k|æ|n|ə|n|b|ɔː|l|z| f|l|aɪ| 
b|ᵻ|f|oːɹ| ð|eɪ|ɚ| f|ɚ|ɹ|ɛ|v|ɚ| b|æ|n|d|?

Additional context In my local copy of phonemizer, I tried a quick hack to the _restore_current method of the Punctuation class that is giving me my desired result. It's simply adding text[0] = text[0].rstrip() at the top of the method. This is probably a very bad idea that breaks other things, but in my very limited testing it's returning the results I'm after.

jncasey commented 2 years ago

I see from the tests that the added spaces are the expected (and, I assume, desired) behavior.

So maybe the behavior I'm after could be exposed with an additional parameter? I'm not sure what it should be called, though – I had assumed that's what --strip was for until I realized it was also for removing the final phoneme/word separators.

For additional context, for part of my project I'm training some phoneme-to-grapheme language models, and I want to be able to roundtrip the punctuation through phonemize and my p2g model to get a pretty close approximation of the source text.

mmmaat commented 2 years ago

Hi Jesse, first I appreciate your example with Dylan lyrics ;) Thanks for your rstrip patch, this is working, I'm committing that to master.

jncasey commented 2 years ago

Oh, great! I'm glad that simple tweak worked (and generates acceptable output for the default use case)

Also related to punctuation, how would you feel about users being able to define the punctuation regex directly, instead of simply listing the characters they want to filter? I'm thinking more along the lines of defining punctuation by what it's not (e.g. [^a-zA-ZÀ-ÖØ-öø-ÿ0-9'] to capture everything that's not a numeral or latin character). Or would that be too much?

mmmaat commented 2 years ago

Actually this is related to issue #87. Can you please open a new issue with your feature request? Thanks.

jncasey commented 2 years ago

Sure thing!