bootphon / phonemizer

Simple text to phones converter for multiple languages
https://bootphon.github.io/phonemizer/
GNU General Public License v3.0
1.18k stars 165 forks source link

Request: more flexibility around punctuation definitions #99

Closed jncasey closed 2 years ago

jncasey commented 2 years ago

Is your feature request related to a problem? Please describe. I'd like more flexibility in defining punctuation, ideally by having access directly to the regex.

Specifically, instead of defining the characters to be counted as punctuation, I think it'd be more useful to me to define which characters are words to be phonemized, and treat everything else as punctuation.

Describe the solution you'd like Something as broad as [^\p{L}\p{M}0-9'] could work as a default, which from what I understand would capture everything that's not a number, unicode letter or its diacritics.

That may be overly broad, though, because I've run into trouble with espeak and characters from Cyrillic and Korean sets already, and I'd imagine characters from other less-supported languages could also be problematic.

Describe alternatives you've considered In my local copy of phonemizer, I've played with hard-coding the punctuation regex like so:

marks = "[^a-zA-ZÀ-ÖØ-öø-ÿ0-9',.$@&+\\-=/\\\]"
self._marks_re = re.compile(fr'(\s*{marks}+\s*|\s*(?<!\d),\s*|\s*(?<!\d)\.(?!\d)\s*)+')

which captures everything that's not a latin character or a set of marks that the backends can pronounce, like "æt" for "@".

The back half of this is also an attempt to handle the problem raised in #87, though I haven't tested it much, and there may be some cases where it breaks.

jncasey commented 2 years ago

@hadware Thinking about this request again. How about adding an option to define a punctuation regex directly? It would override the default punctuation marks, and probably throw an exception if a user tries to define both custom punctuation and a punctuation regex? This is probably the easiest path to get the features I need, without changing the default behavior people have come to expect.

I can try working something up if that feature makes sense to you.

mmmaat commented 2 years ago

Hi @jncasey and thank you for your implication in this project!

Actually, you can already provide punctuation marks as parameter, both from CLI and API as a str that is fed to the Punctuation class and compiled as a regex.

If you prefer to specify directly a precompiled regex instead of a string, it would be possible, for instance, by adding a isinstance check in Punctuation.__init__ (here). But, in that case, you'll lose the Punctuation.mark attribute. I don't think it is really used in the code, so we can imagine raising an exception when accessing a mark attribute from a Punctuation instance built from a regex...

hadware commented 2 years ago

I think I agree with @mmmaat 's suggestion. I'd rather have the signature of Punctuation.__init__ be something like

def __init__(self, marks: Optional[str] = None, pattern: Optional[Union[str, re.Pattern]] = None):
   ...

to make things more obvious maybe? (and if both args are None, it'd still default to DEFAULT_MARKS)

jncasey commented 2 years ago

Thanks, @mmmaat!

I'm aware that it's possible to pass punctuation marks as a string, but in my case, I'm looking to treat all characters that aren't alphanumeric (or accented alpha) as punctuation. That would include the standard marks on the keyboard, but also emoji and lots of other unicode characters. So it's not really practical to pass a list of every possibility.

Which is why I'd like to pass the punctuation regex directly as a parameter.

I like your thought of an additional isinstance check allowing for marks to be a precompiled regex. That would work perfectly in my use case, and require minimal changes everywhere else. But with that approach, it wouldn't be possible to expose the functionality to CLI version, right? There'd be no way to know if the --punctuation-marks parameter was intended as a regex or not.

Does that matter?

jncasey commented 2 years ago

@hadware Oops, we responded at the same time.

I can work off of that signature. And the corresponding new phonemize() param would be punctuation_pattern and the CLI param would be --punctuation-pattern?

hadware commented 2 years ago

Yes! That would be perfect!

hadware commented 2 years ago

(And, again, you can start right away with a [WIP] PR if you feel like working on this now)

PS: you're the best :rocket:

mmmaat commented 2 years ago

Ok @jncasey my idea is incompatible with the CLI.

Why not to add a boolean flag --punctuation-marks-is-regex defaulting to False from the CLI? The name is pretty long, OK, but the punctuation_pattern name is very ambiguous to me and it's confusing to have both marks and pattern for the same functionality.

What do you think?

jncasey commented 2 years ago

In that case, the flag would be checked in main.py, and if it's true, simply args.punctuation_marks = re.compile(args.punctuation_marks)? That's pretty clean.

But ultimately, it seems like a stylistic choice. Happy to implement it either way – I'll follow the lead of whatever you guys collectively decide.