NVIDIA / NeMo-text-processing

NeMo text processing for ASR and TTS
https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/text_normalization/wfst/wfst_text_normalization.html
Apache License 2.0
242 stars 77 forks source link

English text normalization MoneyFst conflict with SerialFst and small weight does not take effect #126

Closed trunglebka closed 7 months ago

trunglebka commented 7 months ago

Rule conflicting between MoneyFst and SerialFst tagger

Steps/Code to reproduce bug

Command:

python nemo_text_processing/text_normalization/normalize.py --verbose --text 'Thank you for the quantities. Now, lets talk about the pricing. The price for each canned salmon is $5, each bottle of peanut butter is $3'

Output:

Thank you for the quantities. Now, lets talk about the pricing. The price for each canned salmon is five dollars, each bottle of peanut butter is dollar three

Expected behavior

Expected output:

Thank you for the quantities. Now, lets talk about the pricing. The price for each canned salmon is five dollars, each bottle of peanut butter is three dollar

Environment overview

Environment details

Additional information I found that there is a conflict between MoneyFst and SerialFst taggers. Both tagger returns the same weight==2404.29785 Computed using pynini.shortestdistance(tagged_lattice, delta=10**-8)[-1]}) https://github.com/NVIDIA/NeMo-text-processing/blob/5dd753a8807b3b3bd9aea954776b71bd73fdb870/nemo_text_processing/text_normalization/normalize.py#L337 Due to the this code: https://github.com/NVIDIA/NeMo-text-processing/blob/5dd753a8807b3b3bd9aea954776b71bd73fdb870/nemo_text_processing/text_normalization/en/taggers/tokenize_and_classify.py#L163-L176

I think that serial_graph's weight should be higher money_graph but it is not, so I disabled MoneyFst to get the weight from SerialFst (changed its olabel to ensure that the weight is from the best path contains SerialFst) for this text and here is the weight with corresponding SerialFst's weight in ClassifyFst.classify:

1.1000  2404.29785
1.1001  2404.29785
1.1002  2404.2981
1.1003  2404.29858
1.1004  2404.29858
1.1005  2404.29883
1.1006  2404.29883
1.1007  2404.29907

English is not my native language, so please forgive me if there is any ambiguity.

ekmb commented 7 months ago

@anand-nv , @tbartley94

anand-nv commented 7 months ago

https://github.com/NVIDIA/NeMo-text-processing/pull/128

trunglebka commented 7 months ago

Increasing the weight of serial_graph fix this error, but why does it produce the same distance with different weights? Can you guys help me with that question? Even more interestingly, if one or two 'unrelated' beginning sentences are removed, the output is correct.

tbartley94 commented 7 months ago

trunglebka

What do you mean by same distance? Same cumulative weighting or same number of arcs to traverse?

Yeah, FSTs can be a bit idiosyncratic over the really long distances. The Pynini backend uses a few quick heuristics when choosing an optimal path so some edge cases will pop up here and there. Weighting is really just a fine-tuning mechanism to limit the space of results.

trunglebka commented 7 months ago

@tbartley94

What do you mean by same distance

I mean cumulative weighting (in the context of shortestdistance)

FSTs can be a bit idiosyncratic

Thank for your information, I tried fstshortestdistance [--reverse] <tag-lattice-from-nemo.fst> and it give different cumulative weighting too: 2404.29785 vs 2404.29761 (--reverse) :skull_and_crossbones:

tbartley94 commented 7 months ago

Thank for your information, I tried fstshortestdistance [--reverse] <tag-lattice-from-nemo.fst> and it give different cumulative weighting too: 2404.29785 vs 2404.29761 (--reverse) ☠️

So the shortest distance call isn't necessarily as straightforward as you'd imagine. We basically do a wrapper around a composite of some of Pynini's rewrite module and projections of the lattice. (It's partially because it eases readability in some places, partially to allow non-deterministic outputs for downstream ASR.) Different flags will mess with this composition and change some of the weights. Since the weights are more a heuristic, it doesn't mess with deterministic outputs but will make the actual quantities vary a tad. Unless you're trying to set-up probabilistic weighting (e.g. LM probabilities), it's not going to be 'that' important. For troubleshooting you're better off just messing around with the final weights in tokenize_and_classify.

If really interested, I'd consult the Pynini documentation (https://www.openfst.org/twiki/bin/view/GRM/Pynini) or look through the internal Python documentation for rewrite. (You're still free to ping us with questions, just that documentation is a pretty thorough starting place for more of the nuts and bolts.)