bootphon / phonemizer

Simple text to phones converter for multiple languages
https://bootphon.github.io/phonemizer/
GNU General Public License v3.0
1.18k stars 165 forks source link

Reworking --preserve_punctuation and --strip #106

Closed jncasey closed 2 years ago

jncasey commented 2 years ago

Is your feature request related to a problem? Please describe. It seems the preserve/restore methods of the Punctuation class make the assumption in a few places that the word separator is always ' '. It happens when the punctuation regex matches \s* and the resulting matched spaces are returned verbatim in the output. I was also guilty of it when I proposed the patch in #97, which triggered the bug reported in #104.

Also, the restore methods do not add word separators to lines that end with punctuation, which was not immediately visible when the word separator is a space, but seems should be the consistent behavior unless --strip is specified.

echo 'hello!' | phonemize -p '|' -w '_' returns h|ə|l|oʊ|_

echo 'hello!' | phonemize -p '|' -w '_' --preserve-punctuation returns h|ə|l|oʊ|_!

but should return h|ə|l|oʊ|!_

Describe the solution you'd like I think the Punctuation methods need to know about the separator and strip values passed to phonemize. I assume as parameters to the methods themselves rather than to the Punctuation constructor, based on how other values are being treated.

Describe alternatives you've considered I may have most of a solution roughly coded in my local version, but first I wanted to check with the maintainers if this is actually a problem and not just expected behavior.

jncasey commented 2 years ago

@hadware Now that #103 is merged, I'd be happy to open a new PR with my fix to this, which would also fix #104.

But first I want to make sure that you agree that I've identified a problem, and with my interpretation of the desired behavior here. If so, it'd mean that I'll update a few of the tests to match this understanding of where separators are supposed to go.

I can also fix #108, either in the same PR or separately – whichever you'd prefer.

hadware commented 2 years ago

Hey!

I think fixing #103 , #104 and #108 with a single PR sound like a good idea :) Go ahead and start a PR (even if it's not entirely done, just prepend the title by "[WIP]", this way I can follow what you're doing).

Thanks a lot for your work!

jncasey commented 2 years ago

Great! I'll build the changes I've been playing with on top of the latest master and upload a WIP in the next day or so.

And, to confirm, you agree that my example above (making sure lines consistently end with word separator regardless of punctuation preservation) is the correct behavior? Some of the test suite is checking for the current behavior, which would be changing, so I'll have to update those tests.

And, additionally: all space characters in the source text should be converted to the word separator. Right now, if a space is captured by the punctuation-preserving regex, it gets included verbatim in the output. I'm thinking that only literal ' ' should be converted, not '\s' in general, or else newlines would get funky.

jncasey commented 2 years ago

Closed by #119