NVIDIA / NeMo-text-processing

NeMo text processing for ASR and TTS
https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/text_normalization/wfst/wfst_text_normalization.html
Apache License 2.0
258 stars 84 forks source link

Hungarian TN ✅ #9

Closed jimregan closed 1 year ago

jimregan commented 1 year ago

What does this PR do ?

Adds TN for Hungarian

Collection: [Note which collection this PR will affect]

Changelog

Usage

# Add a code snippet demonstrating how to use this 

Before your PR is "Ready for review"

Pre checks:

PR Type:

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed. Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

yzhang123 commented 1 year ago

@jimregan , thanks for the PR. Do you know when the PR is ready for review? I see you wanted to add some more classes like time, measure and money. Once you are ready I will ask a Hungarian native speaker review the data and test case files

jimregan commented 1 year ago

@jimregan , thanks for the PR. Do you know when the PR is ready for review? I see you wanted to add some more classes like time, measure and money. Once you are ready I will ask a Hungarian native speaker review the data and test case files

Swedish TN was a work project, so I had more time to work on it. I ought to have some time this evening and this weekend -- I'd like to get it finished quite soon, but I can't make any guarantees. If you'd like, I'll see how far I get by Monday, drop out anything that's left unfinished, and test what's left. There's not much left to do, IIRC, but I wanted to get the more informal way of reading times done for future use (ITN/audio-based TN), and to copy for Swedish ITN (which does something very similar).

Hungarian is standardised quite extensively, so for most of the modules I never really had too many questions, but I do have access to native speakers here. One more can't hurt, though.

There are a couple of decisions that are kind of debatable: for one, in the case of money, I set it to read the amount as a decimal if there is a case ending at the end of the currency chunk, rather than try to transfer it to the currency minor word. It wouldn't be too hard to do that, but the colleague I asked read it as a decimal, so that's what I went with.

jimregan commented 1 year ago

'Ready for review' is not 100% accurate, as more tests are needed, and there's an issue with things not quite working the same in sparrowhawk, but insofar as being more-or-less ready to have an extra pair of eyeballs, it's close enough

yzhang123 commented 1 year ago

'Ready for review' is not 100% accurate, as more tests are needed, and there's an issue with things not quite working the same in sparrowhawk, but insofar as being more-or-less ready to have an extra pair of eyeballs, it's close enough

could you elaborate what does not work with sparrowhawk?

jimregan commented 1 year ago

'Ready for review' is not 100% accurate, as more tests are needed, and there's an issue with things not quite working the same in sparrowhawk, but insofar as being more-or-less ready to have an extra pair of eyeballs, it's close enough

could you elaborate what does not work with sparrowhawk?

ASSERT:100 000 000 000 000 000 000 002 expected:<száztrilliárd-kettő> but was:<száztrilliárd------kettő>

I have a piece that normally replaces repeated hyphens with a single hyphen, but this isn't happening with sparrowhawk.

yzhang123 commented 1 year ago

@jimregan we have reviewed the data, and it looks good. Please rerequest review once this PR is ready to be merged. thanks!

jimregan commented 1 year ago

@jimregan we have reviewed the data, and it looks good. Please rerequest review once this PR is ready to be merged. thanks!

There are a couple of small things: the kestrel difference I mentioned earlier, and one small issue with serialisation. I think there might be an easy with the second, but the first has me baffled. If I can find a moment, I'll see if I can do something about it, but I don't expect to have time until the start of march.

yzhang123 commented 1 year ago

@jimregan we have reviewed the data, and it looks good. Please rerequest review once this PR is ready to be merged. thanks!

There are a couple of small things: the kestrel difference I mentioned earlier, and one small issue with serialisation. I think there might be an easy with the second, but the first has me baffled. If I can find a moment, I'll see if I can do something about it, but I don't expect to have time until the start of march.

cool, thanks! no rush.

jimregan commented 1 year ago

Still the same issue with sparrowhawk:

ASSERT:100 000 000 000 000 000 000 002 expected:<száztrilliárd-kettő> but was:<száztrilliárd------kettő>

yzhang123 commented 1 year ago

Still the same issue with sparrowhawk:

ASSERT:100 000 000 000 000 000 000 002 expected:<száztrilliárd-kettő> but was:<száztrilliárd------kettő>

@anand-nv could you please help?

jimregan commented 1 year ago

Still the same issue with sparrowhawk:

ASSERT:100 000 000 000 000 000 000 002 expected:<száztrilliárd-kettő> but was:<száztrilliárd------kettő>

@anand-nv could you please help?

There's more going on here: the tests pass when run individually, but not when the whole directory is run.

E.g., pytest --cpu tests/nemo_text_processing/hu gives:

E AssertionError: assert 'nullatrilliá...armincegyedik' == 'harmincegyedik' E - harmincegyedik E + nullatrilliárd-harmincegyedik

but with pytest --cpu tests.nemo_text_processing.hu.test_ordinal it passes.

jimregan commented 1 year ago

Still the same issue with sparrowhawk:

ASSERT:100 000 000 000 000 000 000 002 expected:<száztrilliárd-kettő> but was:<száztrilliárd------kettő>

@anand-nv could you please help?

This one's fixed, I just have to fix the ones I managed to introduce along the way :)

jimregan commented 1 year ago
self = <tests.nemo_text_processing.hu.test_money.TestMoney object at 0x7f48cd070520>, test_input = '2,32 GBP-t'
expected = 'kettő egész harminckettő század angol fontot'

    @parameterized.expand(parse_test_case_file('hu/data_text_normalization/test_cases_money.txt'))
    @pytest.mark.run_only_on('CPU')
    @pytest.mark.unit
    def test_norm(self, test_input, expected):
        pred = self.normalizer_hu.normalize(test_input, verbose=False)
>       assert pred == expected
E       AssertionError: assert 'nullatrillió... angol fontot' == 'kettő egész ... angol fontot'
E         - kettő egész harminckettő század angol fontot
E         + nullatrillió-kettő egész nullatrillió-harminckettő század angol fontot
E         ? +++++++++++++            +++++++++++++

The weird thing seems to depend on casing:

>>> from nemo_text_processing.text_normalization.normalize import Normalizer
>>> lc_normalizer = Normalizer(input_case='lower_cased', lang='hu', overwrite_cache=False)
>>> normalizer = Normalizer(input_case='cased', lang='hu', overwrite_cache=False)
>>> normalizer.normalize("2,32 GBP-t")
'nullatrillió-kettő egész nullatrillió-harminckettő század angol fontot'
>>> lc_normalizer.normalize("2,32 GBP-t")
'kettő egész harminckettő század angol fontot'
jimregan commented 1 year ago

The weird thing seems to depend on casing:

Rather, the first instance is fine, subsequent instances have something weird happening:

Python 3.8.0 (default, Nov  6 2019, 21:49:08) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from nemo_text_processing.text_normalization.normalize import Normalizer
>>> tmp = []
>>> for _ in range(0, 6):
...     tmp.append(Normalizer(input_case='lower_cased', lang='hu'))
... 
>>> for norm in tmp:
...     norm.normalize("2,32 GBP-t")
... 
'kettő egész harminckettő század angol fontot'
'nullatrillió-kettő egész nullatrillió-harminckettő század angol fontot'
'nullatrillió-kettő egész nullatrillió-harminckettő század angol fontot'
'nullatrillió-kettő egész nullatrillió-harminckettő század angol fontot'
'nullatrillió-kettő egész nullatrillió-harminckettő század angol fontot'
'nullatrillió-kettő egész nullatrillió-harminckettő század angol fontot'

@yzhang123 any ideas why this might be happening?

jimregan commented 1 year ago

The weird thing seems to depend on casing:

Rather, the first instance is fine, subsequent instances have something weird happening:

cardinal is being built differently second time around:

>>> tmp[0].tagger.cardinal.fst.num_states()
2253
>>> tmp[1].tagger.cardinal.fst.num_states()
3335

Anything that doesn't depend on cardinal is identical.

jimregan commented 1 year ago

Pytest:

=============================================== 199 passed, 3 warnings in 283.31s (0:04:43) ===============================================

Sparrowhawk:

Step 15/15 : RUN echo "DONE"
 ---> Using cache
 ---> 81110575ab3c
Successfully built 81110575ab3c
Successfully tagged sparrowhawk:latest
-v /home/joregan/NeMo-text-processing/tools/text_processing_deployment/docker/../hu/classify:/workspace/sparrowhawk/documentation/grammars/en_toy/classify -v /home/joregan/NeMo-text-processing/tools/text_processing_deployment/docker/../hu/verbalize:/workspace/sparrowhawk/documentation/grammars/en_toy/verbalize
testTNCardinal
testTNDate
testTNDecimal
testTNElectronic
testTNFraction
testTNMoney
testTNOrdinal
testTNTelephone
testTNTime
testTNMeasure
testTNWhitelist
testTNWord

Ran 12 tests.

OK

@yzhang123 this is done

jimregan commented 1 year ago

Ok, slight issue (telephone tests weren't being run, didn't work with Sparrowhawk) resolved, now it's ready.

jimregan commented 1 year ago

@yzhang123 @ekmb 👋

yzhang123 commented 1 year ago

thanks!

yzhang123 commented 1 year ago

@anand-nv for viz

jimregan commented 1 year ago

@Laszlo-Weber sorry about that, Chrome froze and apparently the mouse cursor landed there in the interim

yzhang123 commented 1 year ago

@jimregan how did you fix the sparrowhawk issues?

jimregan commented 1 year ago

@jimregan how did you fix the sparrowhawk issues?

I think it was this: https://github.com/NVIDIA/NeMo-text-processing/pull/9/commits/71cbfe7004d3f990a15e2b5ea43bc22e1a565ae1 -- I moved the '|=' with '000' and then everything worked, but I'd changed everything I could think of by that stage, so I can't say for certain it was that