English text normalization MoneyFst conflict with SerialFst and small weight does not take effect

trunglebka commented 7 months ago

Rule conflicting between MoneyFst and SerialFst tagger

Steps/Code to reproduce bug

Command:

python nemo_text_processing/text_normalization/normalize.py --verbose --text 'Thank you for the quantities. Now, lets talk about the pricing. The price for each canned salmon is $5, each bottle of peanut butter is $3'

Output:

Thank you for the quantities. Now, lets talk about the pricing. The price for each canned salmon is five dollars, each bottle of peanut butter is dollar three

Expected behavior

Expected output:

Thank you for the quantities. Now, lets talk about the pricing. The price for each canned salmon is five dollars, each bottle of peanut butter is three dollar

Environment overview

Environment location: Bare-metal
Method of NeMo install: pip install

Environment details

OS version: Fedora 38
PyTorch version: 2.0.0
Python version: 3.10.10

Additional information I found that there is a conflict between MoneyFst and SerialFst taggers. Both tagger returns the same weight==2404.29785 Computed using pynini.shortestdistance(tagged_lattice, delta=10**-8)[-1]}) https://github.com/NVIDIA/NeMo-text-processing/blob/5dd753a8807b3b3bd9aea954776b71bd73fdb870/nemo_text_processing/text_normalization/normalize.py#L337 Due to the this code: https://github.com/NVIDIA/NeMo-text-processing/blob/5dd753a8807b3b3bd9aea954776b71bd73fdb870/nemo_text_processing/text_normalization/en/taggers/tokenize_and_classify.py#L163-L176

I think that serial_graph's weight should be higher money_graph but it is not, so I disabled MoneyFst to get the weight from SerialFst (changed its olabel to ensure that the weight is from the best path contains SerialFst) for this text and here is the weight with corresponding SerialFst's weight in ClassifyFst.classify:

1.1000  2404.29785
1.1001  2404.29785
1.1002  2404.2981
1.1003  2404.29858
1.1004  2404.29858
1.1005  2404.29883
1.1006  2404.29883
1.1007  2404.29907

English is not my native language, so please forgive me if there is any ambiguity.

ekmb commented 7 months ago

@anand-nv , @tbartley94

anand-nv commented 7 months ago

https://github.com/NVIDIA/NeMo-text-processing/pull/128

trunglebka commented 7 months ago

Increasing the weight of serial_graph fix this error, but why does it produce the same distance with different weights? Can you guys help me with that question? Even more interestingly, if one or two 'unrelated' beginning sentences are removed, the output is correct.

tbartley94 commented 7 months ago

trunglebka

What do you mean by same distance? Same cumulative weighting or same number of arcs to traverse?

Yeah, FSTs can be a bit idiosyncratic over the really long distances. The Pynini backend uses a few quick heuristics when choosing an optimal path so some edge cases will pop up here and there. Weighting is really just a fine-tuning mechanism to limit the space of results.

trunglebka commented 7 months ago

@tbartley94

What do you mean by same distance

I mean cumulative weighting (in the context of shortestdistance)

FSTs can be a bit idiosyncratic

Thank for your information, I tried fstshortestdistance [--reverse] <tag-lattice-from-nemo.fst> and it give different cumulative weighting too: 2404.29785 vs 2404.29761 (--reverse) :skull_and_crossbones:

tbartley94 commented 7 months ago

Thank for your information, I tried fstshortestdistance [--reverse] <tag-lattice-from-nemo.fst> and it give different cumulative weighting too: 2404.29785 vs 2404.29761 (--reverse) ☠️

So the shortest distance call isn't necessarily as straightforward as you'd imagine. We basically do a wrapper around a composite of some of Pynini's rewrite module and projections of the lattice. (It's partially because it eases readability in some places, partially to allow non-deterministic outputs for downstream ASR.) Different flags will mess with this composition and change some of the weights. Since the weights are more a heuristic, it doesn't mess with deterministic outputs but will make the actual quantities vary a tad. Unless you're trying to set-up probabilistic weighting (e.g. LM probabilities), it's not going to be 'that' important. For troubleshooting you're better off just messing around with the final weights in tokenize_and_classify.

If really interested, I'd consult the Pynini documentation (https://www.openfst.org/twiki/bin/view/GRM/Pynini) or look through the internal Python documentation for rewrite. (You're still free to ping us with questions, just that documentation is a pretty thorough starting place for more of the nuts and bolts.)

NVIDIA / NeMo-text-processing

English text normalization MoneyFst conflict with SerialFst and small weight does not take effect #126