Open AMR-KELEG opened 5 years ago
I think functions like atof, strtod should just work with inf as string. Inf is not the most useful weight though, given that inf+x is inf for all x I think at least openfst just decides to bounce when it sees inf arc (considering it a non-arc; hfst also prints in xerox mode +? as analysis with weight inf and etc.).
For OOVs it's good enough to have reasonably high non-inf number, for more advanced implementations one can calculate some probability estimates like https://en.wikipedia.org/wiki/Additive_smoothing, https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing and so forth.
Well, using laplacian smoothing will solve the problem while ensuring that OOV tokens get the highest -log(P) value.
OTOH, lt-print seems to not be showing inf weights as shown above. I am convinced now that an edge with an infinite weight isn't that useful in most fsts.
Well, using laplacian smoothing will solve the problem while ensuring that OOV tokens get the highest -log(P) value.
Yes that should be good.
OTOH, lt-print seems to not be showing inf weights as shown above. I am convinced now that an edge with an infinite weight isn't that useful in most fsts.
Yeah, so infinite weights in tropical semiring are mainly good for theoretical constructions like graph completion (where every state must have transition with every symbol). You could check the code where the inf parsing/printing/handling goes awry, since theoretically it should be possible to support it, but it's not a high priority at all.
I believe the issue here is not with lt-comp
but with the way floating point numbers are written in the current file format since the functions used in compression.cc
to disassemble double
s are unspecified when applied to inf
(https://en.cppreference.com/w/cpp/numeric/math/frexp).
We can reserve 0xFFFFFFFF 0xFFFFFFFF as inf
. But is -inf
meaningful?
I think the tropical semiring weight structures we use are only well defined in R+ including positive infinity, they may kind of work with negative values and I guess one could interpret a path with negative infinity as unconditionally top suggestion...
Implemented by reserving 0xFFFFFFFF 0xFFFFFFFF
as inf
and 0xFFFFFFFF 0xFFFFFFFE
as -inf
.
ICU u_sscanf()
only supports all-upper INF
and -INF
, and will print all-upper. So first quirk was adding a special case parse for lower-case inf
and -inf
.
See if that breaks anything.
We need to implement a way to represent infinite weights. The current outcome is strange!