Closed trunglebka closed 7 months ago
@anand-nv , @tbartley94
Increasing the weight of serial_graph fix this error, but why does it produce the same distance with different weights? Can you guys help me with that question? Even more interestingly, if one or two 'unrelated' beginning sentences are removed, the output is correct.
trunglebka
What do you mean by same distance? Same cumulative weighting or same number of arcs to traverse?
Yeah, FSTs can be a bit idiosyncratic over the really long distances. The Pynini backend uses a few quick heuristics when choosing an optimal path so some edge cases will pop up here and there. Weighting is really just a fine-tuning mechanism to limit the space of results.
@tbartley94
What do you mean by same distance
I mean cumulative weighting (in the context of shortestdistance)
FSTs can be a bit idiosyncratic
Thank for your information, I tried fstshortestdistance [--reverse] <tag-lattice-from-nemo.fst>
and it give different cumulative weighting too: 2404.29785 vs 2404.29761 (--reverse) :skull_and_crossbones:
Thank for your information, I tried
fstshortestdistance [--reverse] <tag-lattice-from-nemo.fst>
and it give different cumulative weighting too: 2404.29785 vs 2404.29761 (--reverse) ☠️
So the shortest distance call isn't necessarily as straightforward as you'd imagine. We basically do a wrapper around a composite of some of Pynini's rewrite
module and projections of the lattice. (It's partially because it eases readability in some places, partially to allow non-deterministic outputs for downstream ASR.) Different flags will mess with this composition and change some of the weights. Since the weights are more a heuristic, it doesn't mess with deterministic outputs but will make the actual quantities vary a tad. Unless you're trying to set-up probabilistic weighting (e.g. LM probabilities), it's not going to be 'that' important. For troubleshooting you're better off just messing around with the final weights in tokenize_and_classify
.
If really interested, I'd consult the Pynini documentation (https://www.openfst.org/twiki/bin/view/GRM/Pynini) or look through the internal Python documentation for rewrite
. (You're still free to ping us with questions, just that documentation is a pretty thorough starting place for more of the nuts and bolts.)
Rule conflicting between MoneyFst and SerialFst tagger
Steps/Code to reproduce bug
Command:
Output:
Expected behavior
Expected output:
Environment overview
Environment details
Additional information I found that there is a conflict between MoneyFst and SerialFst taggers. Both tagger returns the same weight==
2404.29785
Computed usingpynini.shortestdistance(tagged_lattice, delta=10**-8)[-1]})
https://github.com/NVIDIA/NeMo-text-processing/blob/5dd753a8807b3b3bd9aea954776b71bd73fdb870/nemo_text_processing/text_normalization/normalize.py#L337 Due to the this code: https://github.com/NVIDIA/NeMo-text-processing/blob/5dd753a8807b3b3bd9aea954776b71bd73fdb870/nemo_text_processing/text_normalization/en/taggers/tokenize_and_classify.py#L163-L176I think that
serial_graph
's weight should be highermoney_graph
but it is not, so I disabled MoneyFst to get the weight from SerialFst (changed its olabel to ensure that the weight is from the best path contains SerialFst) for this text and here is the weight with corresponding SerialFst's weight in ClassifyFst.classify:English is not my native language, so please forgive me if there is any ambiguity.