NVIDIA / NeMo-text-processing

NeMo text processing for ASR and TTS
https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/text_normalization/wfst/wfst_text_normalization.html
Apache License 2.0
258 stars 84 forks source link

Swedish TN #12

Closed jimregan closed 1 year ago

jimregan commented 1 year ago

Signed-off-by: Jim O'Regan joregan@kth.se

What does this PR do ?

TN for Swedish, for Språkbanken Tal

I forgot to signoff a couple of commits, and the attempt to fix that made things worse, touching commits that shouldn't have been touched, so this is a fresh start.

Collection: [Note which collection this PR will affect]

Changelog

Usage

# Add a code snippet demonstrating how to use this 

Before your PR is "Ready for review"

Pre checks:

PR Type:

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed. Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

ekmb commented 1 year ago

Hi @jimregan, thank you for your contribution! Is the PR ready for review?

jimregan commented 1 year ago

Hi @jimregan, thank you for your contribution! Is the PR ready for review?

I hope so. I've just built the docker file to test it with sparrowhawk, which was probably the main outstanding item.

jimregan commented 1 year ago

Just for clarification:

My aim with this pull request is to get passable TN working: my use case is using the audio-based TN -- I won't know until I've passed my corpus through it what needs to be done, but I expect it to be mostly limited to the non-deterministic cases. (Time in particular is missing the colloquial way of telling the time, but I have better documentation for Hungarian, so I'll tackle that first).

I have some pieces for ITN, but completing that will probably be on my own time.

jimregan commented 1 year ago

Ok, so mostly ready, aside from three errors in Sparrowhawk:

ASSERT:2:a januari, 2022 expected:<andra januari tjugotjugotvå> but was:<>
ASSERT:2:a januari, 2022 f.Kr. expected:<andra januari tjugotjugotvå före Kristus> but was:<>
ASSERT:2022 f.Kr. expected:<tjugotjugotvå före Kristus> but was:<>

Which I can't reproduce:

>>> from nemo_text_processing.text_normalization.sv.taggers.cardinal import CardinalFst
>>> from nemo_text_processing.text_normalization.sv.taggers.ordinal import OrdinalFst
>>> from nemo_text_processing.text_normalization.sv.taggers.date import DateFst
>>> from nemo_text_processing.text_normalization.sv.verbalizers.date import DateFst as vDate
>>> import pynini
>>> card = CardinalFst()
>>> ord = OrdinalFst(card, True)
>>> date = DateFst(card, ord, True)
>>> vdate = vDate()
>>> tvdate = pynini.compose(date.fst, vdate.fst)
>>> ("2:a januari, 2022" @ tvdate).string()
'andra januari tjugotjugotvå'
>>> ("2:a januari, 2022 f.Kr." @ tvdate).string()
'andra januari tjugotjugotvå före Kristus'
>>> ("2022 f.Kr." @ tvdate).string()
'tjugotjugotvå före Kristus'
jimregan commented 1 year ago

Ok, ready for review now

jimregan commented 1 year ago

@ekmb do you want the PYNINI_AVAILABLE stuff removed from the tests here, as in #17 ?

ekmb commented 1 year ago

@ekmb do you want the PYNINI_AVAILABLE stuff removed from the tests here, as in #17 ?

Yes, please.

ekmb commented 1 year ago

some tests are failing for me:

============================================================== short test summary info ==============================================================
FAILED sv/test_cardinal.py::TestCardinal::test_norm_27_147451 - AssertionError: assert 'hundrafyrtiosjutusen fyrahundrafemtioett' in {'ett hundra fyrti sju ett tusen fyrahundra femtiett', 'ett hundra fyrti sj...
FAILED sv/test_cardinal.py::TestCardinal::test_norm_28_1056173 - AssertionError: assert 'miljon femtiosextusen hundrasjuttiotre' in {'en miljon femti sex ett tusen etthundra sjutti tre', 'en miljon femti sex e...
FAILED sv/test_cardinal.py::TestCardinal::test_norm_30_1593072961 - AssertionError: assert 'miljard femhundranittiotre miljoner sjuttiotvåtusen niohundrasextioett' in {'en miljard fem hundra nitti tre miljoner sj...
FAILED sv/test_cardinal.py::TestCardinal::test_norm_31_100593072961 - AssertionError: assert 'hundra miljarder femhundranittiotre miljoner sjuttiotvåtusen niohundrasextioett' in {'ett hundra miljarder fem hundra ni...
FAILED sv/test_cardinal.py::TestCardinal::test_norm_32_950593072961 - AssertionError: assert 'niohundrafemtio miljarder femhundranittiotre miljoner sjuttiotvåtusen niohundrasextioett' in {'nio hundra femti miljarde...
FAILED sv/test_money.py::TestMoney::test_norm_33__18_925_000 - AssertionError: assert 'arton miljoner niohundratjugofemtusen dollar' in {'aderton miljoner nio hundra tjugi fem ett tusen dollar', 'aderton mil...
FAILED sv/test_telephone.py::TestTelephone::test_norm_2_tfn_08_789_52_25 - AssertionError: assert 'telefon noll åtta, sjuhundraåttionio femtiotvå tjugofem' in {'telefon noll åtta, sju hundraåttinio fem två två fem', 'te...
FAILED sv/test_telephone.py::TestTelephone::test_norm_4__46_8_790_7559 - AssertionError: assert 'plus fyrtiosex åtta, sjuhundranittio sjuttiofem femtionio' in {'plus fyrti sex åtta, sjuhundranitti sjuttiofem fem nio',...
FAILED sv/test_telephone.py::TestTelephone::test_norm_5_IP_adressen_r_193_1_89_60 - AssertionError: assert 'IP-adressen är hundranittiotre punkt ett punkt åttionio punkt sextio' in {'IP-adressen är ett hundranittio tre punkt ett...
FAILED sv/test_telephone.py::TestTelephone::test_norm_6_08_790_7559_ankn_32 - AssertionError: assert 'noll åtta, sjuhundranittio sjuttiofem femtionio anknytning trettiotvå' in {'noll åtta, sju hundra nitti sjuttiofem femti...
========================================================== 10 failed, 238 passed in 17.56s ==========================================================
jimregan commented 1 year ago

some tests are failing for me:

============================================================== short test summary info ==============================================================
FAILED sv/test_cardinal.py::TestCardinal::test_norm_27_147451 - AssertionError: assert 'hundrafyrtiosjutusen fyrahundrafemtioett' in {'ett hundra fyrti sju ett tusen fyrahundra femtiett', 'ett hundra fyrti sj...
FAILED sv/test_cardinal.py::TestCardinal::test_norm_28_1056173 - AssertionError: assert 'miljon femtiosextusen hundrasjuttiotre' in {'en miljon femti sex ett tusen etthundra sjutti tre', 'en miljon femti sex e...
FAILED sv/test_cardinal.py::TestCardinal::test_norm_30_1593072961 - AssertionError: assert 'miljard femhundranittiotre miljoner sjuttiotvåtusen niohundrasextioett' in {'en miljard fem hundra nitti tre miljoner sj...
FAILED sv/test_cardinal.py::TestCardinal::test_norm_31_100593072961 - AssertionError: assert 'hundra miljarder femhundranittiotre miljoner sjuttiotvåtusen niohundrasextioett' in {'ett hundra miljarder fem hundra ni...
FAILED sv/test_cardinal.py::TestCardinal::test_norm_32_950593072961 - AssertionError: assert 'niohundrafemtio miljarder femhundranittiotre miljoner sjuttiotvåtusen niohundrasextioett' in {'nio hundra femti miljarde...
FAILED sv/test_money.py::TestMoney::test_norm_33__18_925_000 - AssertionError: assert 'arton miljoner niohundratjugofemtusen dollar' in {'aderton miljoner nio hundra tjugi fem ett tusen dollar', 'aderton mil...
FAILED sv/test_telephone.py::TestTelephone::test_norm_2_tfn_08_789_52_25 - AssertionError: assert 'telefon noll åtta, sjuhundraåttionio femtiotvå tjugofem' in {'telefon noll åtta, sju hundraåttinio fem två två fem', 'te...
FAILED sv/test_telephone.py::TestTelephone::test_norm_4__46_8_790_7559 - AssertionError: assert 'plus fyrtiosex åtta, sjuhundranittio sjuttiofem femtionio' in {'plus fyrti sex åtta, sjuhundranitti sjuttiofem fem nio',...
FAILED sv/test_telephone.py::TestTelephone::test_norm_5_IP_adressen_r_193_1_89_60 - AssertionError: assert 'IP-adressen är hundranittiotre punkt ett punkt åttionio punkt sextio' in {'IP-adressen är ett hundranittio tre punkt ett...
FAILED sv/test_telephone.py::TestTelephone::test_norm_6_08_790_7559_ankn_32 - AssertionError: assert 'noll åtta, sjuhundranittio sjuttiofem femtionio anknytning trettiotvå' in {'noll åtta, sju hundra nitti sjuttiofem femti...
========================================================== 10 failed, 238 passed in 17.56s ==========================================================

This is really odd, because (at least with the cardinals) they work in deterministic mode

jimregan commented 1 year ago

I'm removing the audio-based normalisation for now, until I have time to look at it. Regular normalisation should work fine.

jimregan commented 1 year ago

Thank you @jimregan!! Great work!

Thanks!