EveryVoiceTTS / EveryVoice

The EveryVoice TTS Toolkit - Text To Speech for your language

https://docs.everyvoice.ca

Other

19 stars 2 forks source link

fix: add whitespace collapsing and text stripping by default #518

Closed roedoejet closed 2 months ago

roedoejet commented 2 months ago

PR Goal?

Whitespace collapsing wasn't being applied by the wizard, only the chosen cleaners. This changes that.

Fixes?

Feedback sought?

sanity, code check

Priority?

medium

Tests added?

✅

How to test?

Confidence?

medium-high

Version change?

Related PRs?

515 and #516

semanticdiff-com[bot] commented 2 months ago

Review changes with SemanticDiff.

Analyzed 6 of 7 files.

Overall, the semantic diff is 6% smaller than the GitHub diff.

	Filename	Status
:heavy_check_mark:	everyvoice/wizard/dataset.py	2.26% smaller
:heavy_check_mark:	everyvoice/utils/__init__.py	0.0% smaller
:heavy_check_mark:	everyvoice/tests/test_text.py	Analyzed
:heavy_check_mark:	everyvoice/tests/test_wizard.py	2.92% smaller
:grey_question:	everyvoice/tests/data/unit-test-case1.psv	Unsupported file format
:heavy_check_mark:	everyvoice/model/e2e/config/__init__.py	12.5% smaller
:heavy_check_mark:	everyvoice/config/text_config.py	92.33% smaller

github-actions[bot] commented 2 months ago

CLI load time: 0:00.23
Pull Request HEAD: 30fde0c53efb793567ab7b5af7d6f51b1e46c976
Imports that take more than 0.1 s:
import time: self [us] | cumulative | imported package

codecov[bot] commented 2 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 74.54%. Comparing base (9004aad) to head (73a03ec).

Additional details and impacted files

```diff @@ Coverage Diff @@ ## dev.ap/text-fixes #518 +/- ## ===================================================== + Coverage 74.50% 74.54% +0.04% ===================================================== Files 45 45 Lines 3016 3021 +5 Branches 485 487 +2 ===================================================== + Hits 2247 2252 +5 Misses 676 676 Partials 93 93 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

roedoejet commented 2 months ago

Is it an expected behaviour that, if no text normalization is applied, the character list remains duplicates of empty symbols? For example, creating a project from /sgile/data/MohawkCorpus/am_corpus without text normalization results in
moh_characters: ['', '', (, ), '0', '1', '3', '8', a, c, d, e, g, h, i, k, l, m,
    n, o, p, r, s, t, u, v, w, x, y, z, '', à, á, è, é, ì, í, ò, ó]
where you can find three instances of ''.

I also noticed that changes from #515 are highlighted as changes of this PR as well.

great find @wiitt - no, this is not intended behaviour. It was an unnecessary if response: condition that was preventing it from applying. I've fixed this and added a test to catch this in the future.