EveryVoiceTTS / EveryVoice

The EveryVoice TTS Toolkit - Text To Speech for your language
https://docs.everyvoice.ca
Other
19 stars 2 forks source link

fix: add whitespace collapsing and text stripping by default #518

Closed roedoejet closed 2 months ago

roedoejet commented 2 months ago

PR Goal?

Whitespace collapsing wasn't being applied by the wizard, only the chosen cleaners. This changes that.

Fixes?

Feedback sought?

sanity, code check

Priority?

medium

Tests added?

How to test?

Confidence?

medium-high

Version change?

no

Related PRs?

515 and #516

semanticdiff-com[bot] commented 2 months ago

Review changes with SemanticDiff.

Analyzed 6 of 7 files.

Overall, the semantic diff is 6% smaller than the GitHub diff.

Filename Status
:heavy_check_mark: everyvoice/wizard/dataset.py 2.26% smaller
:heavy_check_mark: everyvoice/utils/__init__.py 0.0% smaller
:heavy_check_mark: everyvoice/tests/test_text.py Analyzed
:heavy_check_mark: everyvoice/tests/test_wizard.py 2.92% smaller
:grey_question: everyvoice/tests/data/unit-test-case1.psv Unsupported file format
:heavy_check_mark: everyvoice/model/e2e/config/__init__.py 12.5% smaller
:heavy_check_mark: everyvoice/config/text_config.py 92.33% smaller
github-actions[bot] commented 2 months ago
CLI load time: 0:00.23
Pull Request HEAD: 30fde0c53efb793567ab7b5af7d6f51b1e46c976
Imports that take more than 0.1 s:
import time: self [us] | cumulative | imported package
codecov[bot] commented 2 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 74.54%. Comparing base (9004aad) to head (73a03ec).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## dev.ap/text-fixes #518 +/- ## ===================================================== + Coverage 74.50% 74.54% +0.04% ===================================================== Files 45 45 Lines 3016 3021 +5 Branches 485 487 +2 ===================================================== + Hits 2247 2252 +5 Misses 676 676 Partials 93 93 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

roedoejet commented 2 months ago

Is it an expected behaviour that, if no text normalization is applied, the character list remains duplicates of empty symbols? For example, creating a project from /sgile/data/MohawkCorpus/am_corpus without text normalization results in

moh_characters: ['', '', (, ), '0', '1', '3', '8', a, c, d, e, g, h, i, k, l, m,
    n, o, p, r, s, t, u, v, w, x, y, z, '', à, á, è, é, ì, í, ò, ó]

where you can find three instances of ''.

I also noticed that changes from #515 are highlighted as changes of this PR as well.

great find @wiitt - no, this is not intended behaviour. It was an unnecessary if response: condition that was preventing it from applying. I've fixed this and added a test to catch this in the future.