bootphon / phonemizer

Simple text to phones converter for multiple languages
https://bootphon.github.io/phonemizer/
GNU General Public License v3.0
1.18k stars 165 forks source link

Adding preserve_empty_lines option #103

Closed jncasey closed 2 years ago

jncasey commented 2 years ago

Adding the feature I requested in #95.

Not stripping out the empty lines was causing problems with the festival backend and preserve_punctuation, so I took the approach of stripping out the empty lines pre-phonemization and then reinserting them afterward.

I want to flag that I changed the conversion of the input text to a list from a generator to list comprehension here since I needed to run through the list a second time to preserve the empty lines, and I thought this made the code easier to read than making another generator. I'm assuming that this won't make a big difference on performance given how I think phonemizer is generally used.

jncasey commented 2 years ago

This is going to need some more work.

I guess all of my testing was with njobs=1. With parallel jobs, it seems the flattener doesn't the empty lines (at least with espeak). I'll look into fixing it.

jncasey commented 2 years ago

It turns out this was an existing bug in the library. With njobs>1 and an empty input (or one that gets preprocessed to empty), _flatten in the espeak backend would throw an IndexError.

Even if you decide not to merge this new feature, it might still be a good idea to add this fix (or equivalent)

codecov[bot] commented 2 years ago

Codecov Report

Merging #103 (1447a85) into master (8cbdb5a) will not change coverage. The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master      #103   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           23        23           
  Lines         1139      1152   +13     
=========================================
+ Hits          1139      1152   +13     
Impacted Files Coverage Δ
phonemizer/__init__.py 100.00% <100.00%> (ø)
phonemizer/backend/espeak/espeak.py 100.00% <100.00%> (ø)
phonemizer/main.py 100.00% <100.00%> (ø)
phonemizer/phonemize.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 8cbdb5a...1447a85. Read the comment docs.

hadware commented 2 years ago

Alright, this is almost ready for merge. Could you try and add a test or two in order to make codecov happy again, and then we're go :)

PS: don't forget to pull, i've merged the master onto your branch to fix some conflicts.

jncasey commented 2 years ago

Okay, great! I'll have a look at this soon.

jncasey commented 2 years ago

These extra tests should hit those missed lines.

[Sorry, I don't have access to the festival backend on my local computer at the moment, but I can't imagine any reason why they wouldn't pass.]

hadware commented 2 years ago

Alright, we're go!

I'll merge that and (if I have the right permissions) upload it to pipy tomorrow! Great work, thanks for that @jncasey !

hadware commented 2 years ago

BTW: you might want to close the issues related to this PR.