apertium / lttoolbox

Finite state compiler, processor and helper tools used by apertium
http://wiki.apertium.org/wiki/Lttoolbox
GNU General Public License v2.0
18 stars 22 forks source link

Support infinite weights in lt-comp/lt-proc #62

Open AMR-KELEG opened 5 years ago

AMR-KELEG commented 5 years ago

We need to implement a way to represent infinite weights. The current outcome is strange!

$ cat sample.att
0       1       a       b       2
1       2       b       c       1
1       2       c       d       inf
2       0

$ lt-comp lr sample.att sa.bin
main@standard 3 3

$ lt-print sa.bin
0       1       a       b       1.000000
1       2       b       c       2.000000
1       2       c       d       -2.000000
2       0.000000
flammie commented 5 years ago

I think functions like atof, strtod should just work with inf as string. Inf is not the most useful weight though, given that inf+x is inf for all x I think at least openfst just decides to bounce when it sees inf arc (considering it a non-arc; hfst also prints in xerox mode +? as analysis with weight inf and etc.).

For OOVs it's good enough to have reasonably high non-inf number, for more advanced implementations one can calculate some probability estimates like https://en.wikipedia.org/wiki/Additive_smoothing, https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing and so forth.

AMR-KELEG commented 5 years ago

Well, using laplacian smoothing will solve the problem while ensuring that OOV tokens get the highest -log(P) value.

OTOH, lt-print seems to not be showing inf weights as shown above. I am convinced now that an edge with an infinite weight isn't that useful in most fsts.

flammie commented 5 years ago

Well, using laplacian smoothing will solve the problem while ensuring that OOV tokens get the highest -log(P) value.

Yes that should be good.

OTOH, lt-print seems to not be showing inf weights as shown above. I am convinced now that an edge with an infinite weight isn't that useful in most fsts.

Yeah, so infinite weights in tropical semiring are mainly good for theoretical constructions like graph completion (where every state must have transition with every symbol). You could check the code where the inf parsing/printing/handling goes awry, since theoretically it should be possible to support it, but it's not a high priority at all.

mr-martian commented 2 years ago

I believe the issue here is not with lt-comp but with the way floating point numbers are written in the current file format since the functions used in compression.cc to disassemble doubles are unspecified when applied to inf (https://en.cppreference.com/w/cpp/numeric/math/frexp).

TinoDidriksen commented 2 years ago

We can reserve 0xFFFFFFFF 0xFFFFFFFF as inf. But is -inf meaningful?

flammie commented 2 years ago

I think the tropical semiring weight structures we use are only well defined in R+ including positive infinity, they may kind of work with negative values and I guess one could interpret a path with negative infinity as unconditionally top suggestion...

TinoDidriksen commented 2 years ago

Implemented by reserving 0xFFFFFFFF 0xFFFFFFFF as inf and 0xFFFFFFFF 0xFFFFFFFE as -inf.

ICU u_sscanf() only supports all-upper INF and -INF, and will print all-upper. So first quirk was adding a special case parse for lower-case inf and -inf.

See if that breaks anything.