Closed asfimport closed 12 years ago
Dawid Weiss (@dweiss) (migrated from JIRA)
An empty (but compiling and consistent) take at the FST/FSA API.
Dawid Weiss (@dweiss) (migrated from JIRA)
This is my take at the revamped FST API. My changes are mostly aiming at having a bit clearer code (especially wrt. to loops), but also detach the "algebra" of a transition's output from the actual output. This should allow us to create an output algebra that would work directly on mutable integers, for example (to save on autoboxing). I also just like the way it reads after the changes:
FST<Integer> fst = FSTBuilder.fst(FST.ArcLabel.BYTE2, PositiveInt.class)
.add("abc", 10)
.add("abc, 5)
.add("def", 0, 3), 2)
.build();
or a loop over all arcs of a state:
Arc<Integer> arc = fst.getRoot();
for (Arc<Integer> tmp = arc.copy(); tmp.hasNext(); tmp.next()) {
int label = tmp.getLabel(); // transition label here.
Integer output = tmp.getOutput();
}
I definitely didn't consider all the use cases that FSTs are used for currently (in particular the "stop" bit indicating non-accepted input sequences that are also dead ends), but I think these could be integrated... I think :)
Arcs now also store the pointer to the FST object, which may seem like an overhead, but I doubt it really will be (it's a single pointer and we buffer arcs whenever we can; a larger waste is having an object on each arc's output, even if it can be a primitive type or reused buffer).
Michael McCandless (@mikemccand) (migrated from JIRA)
This new FST API looks sweet! Nice work :)
So with this we no longer need static Util methods right? (Since each arc can .follow a sequence of inputs).
I like OutputAlgebra ... better matches what this class actually does, and if this means we can not create a new Object for every arc transition that would be great (this makes FST lookups costly now).
I don't know if this is possible, but, one thing I don't like about the current API is that the BYTE1/2/4 is an enum and not parameterized into the Builder/FST. Ie, Builder/FST should really take the input type as a type param too, since really an FST acts like a SortedMap<K,V>. But I fear this could get scary-hairy w/ the required generics...
Dawid Weiss (@dweiss) (migrated from JIRA)
Thanks Mike. I agree it'd be nice to have a flexible label type as well, but I have no idea how to make it efficient (and code-clean) yet. You could do a similar thing as with the outputs (using either a boxed type if you don't care about performance that much or a mutable wrapper if you do care about GC), but how this would affect the API I have no idea right now. There is also the lexicographic order that one would need to consider (a comparator would need to be passed as part of the construction process and then for traversals). It'll get complicated.
I was also thinking of just dropping support for BYTE1/2 and leaving fixed int labels... This would bloat byte-labeled automata a little bit (if they're ASCII they'd v-code into a single byte anyway), but would strip down the ugliness of BYTE1/2/4... All methods accepting BytesRef and CharSequence would still be there, translated on the fly, but the representation of labels would always be an int.
One more question: can you give me traversal use cases you're using FSTs for now? I'll try to implement them and see how the new API works out in practice. I looked at the FSTEnum and it has next(), seekCeil() and seekFloor().
I'm also a bit terrified by the about of changes this would introduce if we decided to switch the APIs (tests, scattered use cases...). Don't know if I'll have the time to update this all.
Michael McCandless (@mikemccand) (migrated from JIRA)
Thanks Mike. I agree it'd be nice to have a flexible label type as well, but I have no idea how to make it efficient (and code-clean) yet. You could do a similar thing as with the outputs (using either a boxed type if you don't care about performance that much or a mutable wrapper if you do care about GC), but how this would affect the API I have no idea right now. There is also the lexicographic order that one would need to consider (a comparator would need to be passed as part of the construction process and then for traversals). It'll get complicated.
Yeah this was my fear :)
I was also thinking of just dropping support for BYTE1/2 and leaving fixed int labels... This would bloat byte-labeled automata a little bit (if they're ASCII they'd v-code into a single byte anyway), but would strip down the ugliness of BYTE1/2/4... All methods accepting BytesRef and CharSequence would still be there, translated on the fly, but the representation of labels would always be an int.
Hmm, that makes me nervous – this could be a non-negligible increase in FST size for the non-ascii case I think?
One more question: can you give me traversal use cases you're using FSTs for now? I'll try to implement them and see how the new API works out in practice. I looked at the FSTEnum and it has next(), seekCeil() and seekFloor().
I think SimpleText codec is a good example? Also VariableGapTermsIndexReader, and MemoryCodec? Each of these use the BytesRefFSTEnum, I believe.
I'm also a bit terrified by the about of changes this would introduce if we decided to switch the APIs (tests, scattered use cases...). Don't know if I'll have the time to update this all.
I think it's still fairly contained at this point? (Ie the number of tests that directly use the FST APIs).
Dawid Weiss (@dweiss) (migrated from JIRA)
this could be a non-negligible increase in FST size for the non-ascii case I think?
I don't know. If the non-ASCII is encoded as UTF8 for the BytesRef, then storing full unicode points on transitions shouldn't really account for much more (in fact it may create fewer states/ transitions because multibyte UTF8 sequences will require multiple transitions)? This we would need to check, of course. And I assume input sequences ARE text, which in general may not be the case... I think I'll leave BYTE1/BYTE4 an option for now and see if I can improve on it once I have a working test suite.
I think SimpleText codec is a good example? Also VariableGapTermsIndexReader, and MemoryCodec? Each of these use the BytesRefFSTEnum, I believe.
I wasn't clear – I can find the places where they're used, but I wanted to clarify the nature of stored keys and values (are they UTF8 text, utf16, unicode, random bytes)? I can go through the code, but you're probably a faster source of information on this one. Robert, if you're reading this – anything you envision could be stored as transition labels?
Michael McCandless (@mikemccand) (migrated from JIRA)
this could be a non-negligible increase in FST size for the non-ascii case I think?
I don't know. If the non-ASCII is encoded as UTF8 for the BytesRef, then storing full unicode points on transitions shouldn't really account for much more (in fact it may create fewer states/ transitions because multibyte UTF8 sequences will require multiple transitions)? This we would need to check, of course. And I assume input sequences ARE text, which in general may not be the case... I think I'll leave BYTE1/BYTE4 an option for now and see if I can improve on it once I have a working test suite.
Ahh, yes I agree it'd be a more interesting comparison if you use UTF32 instead of UTF8.
The case I was worried about is if you must use UTF8 (ie because TermsEnum speaks only BytesRef), then writing those bytes as a vInt instead of a fixed byte is a penalty to non-ascii.
I think SimpleText codec is a good example? Also VariableGapTermsIndexReader, and MemoryCodec? Each of these use the BytesRefFSTEnum, I believe.
I wasn't clear – I can find the places where they're used, but I wanted to clarify the nature of stored keys and values (are they UTF8 text, utf16, unicode, random bytes)? I can go through the code, but you're probably a faster source of information on this one. Robert, if you're reading this – anything you envision could be stored as transition labels?
Ahh... I think all uses have BytesRef (UTF8 encoded term) as the key, and various things as the values.
I don't think we've used FST during analysis yet but we should try; then I suspect we'd use UTF16 labels?
Dawid Weiss (@dweiss) (migrated from JIRA)
I think I know how to compare storing byte[] of UTF8 as compared to vint-encoded codepoints in UTF32 – I'll encode the wikipedia terms list in both ways and we will see what comes out. Theoretically they should be very, very similar (and full unicode codepoints should generate fewer arcs) because UTF8 uses an encoding scheme with similar overhead to vint encoding... os if something is a single-byte sequence in UTF8, will remain single byte vint. Double-byte UTF8 character will remaing double-byte vint (last double byte codepoint is 0x7ff=2047, whereas the last double byte vint is 2^14=16384. And so on. So for text, vint-encoded UTF32 should be more compact than UTF8... The gain is of course when your "labels" are not text, but arbitrary bytes – then byte[] representation would be nicer.
Dawid Weiss (@dweiss) (migrated from JIRA)
I encoded wikipedia termslist in UTF32 (int4) and UTF8 (int1). Interesting results:
271,461,850 utf32.fst
Arcs: 64.485.082
Nodes: 36.624.613
270,137,939 utf8.fst
Arcs: 66.478.193
Nodes: 38.687.637
So... the files are pretty much the same size... UTF32 is slighly bigger, but (as predicted) it has fewer arcs and fewer nodes. I checked and ALL input UTF8 strings are the same or longer than vint-coded UTF32 sequences... So how come UTF32 automaton is larger? I have no clue – I assume it may be something with the size of v-coded pointers... but I have no clue. In any case, the size gain from using int1 to encode UTF8 is minimal over just using full unicode codepoints and v-coded int4. Performance-wise it may be a hit (because one would need to convert UTF8/UTF16 to full unicode codepoints), but size-wise it seems to be relatively the same.
Dawid Weiss (@dweiss) (migrated from JIRA)
Oh, a wild guess: with int4 more nodes will be expanded into bsearch arrays (fixed size arcs). This may account for the observed size difference. And it may matter for traversals too (because int4 nodes will have a higher fanout, especially at root and first levels... something to consider).
Dawid Weiss (@dweiss) (migrated from JIRA)
I confirmed the above, out of sheer curiosity, by compiling without expanded nodes (just linear packing).
261,820,296 utf32-noexp.fst
271,461,850 utf32.fst
263,415,558 utf8-noexp.fst
270,137,939 utf8.fst
Michael McCandless (@mikemccand) (migrated from JIRA)
OK, these results make sense! UTF32 (vInt labels) is more compact than UTF8, if you disable array'd arcs. These wiki terms are from the en export right? So the differences are due to the smallish number of random terms that are not English... it should be more extreme if we used non-English content.
I wonder how lookup time would compare... I think UTF32 should be faster?
And yes for truly binary terms (eg collated fields, and maybe eventually numeric fields but not yet because they still avoid the 8th bit I think) I think we want to keep BYTE1.
We need some good use cases of FSTs during analysis... there we are free to make the alphabet non-byte (vs the index, where terms are a BytesRef).
Uwe Schindler (@uschindler) (migrated from JIRA)
Just one question: Is sort order for suppl chars in UTF32 compatible to UTF8 or will there be more problems? As we have a special case for UTF16 (surrogates dance), so what happens with a third "encoding"?
Dawid Weiss (@dweiss) (migrated from JIRA)
UTF32 is basically codepoint representation, so there are no surrogates (as in UTF16) and there is no special encoding of higher codepoints (as in UTF8). I don't know what sort order is used inside Lucene (is it UTF8 byte-to-byte values or decoded codepoints?). If it is codepoint order then no problem – this should be preserved.
I'll stick to BYTE1/BYTE4 inputs then for now and I'll try to push this patch forward in my spare time.
Uwe Schindler (@uschindler) (migrated from JIRA)
The term index is sorted by utf8 bytes natively. So a FST build on term index must assume that order, because the terms must be presorted for Builder. Lucene internally only works on byte[] and uses per default this order. Also most queries rely on it.
Dawid Weiss (@dweiss) (migrated from JIRA)
This is fine – "Sorting of UTF-8 strings as arrays of unsigned bytes will produce the same results as sorting them based on Unicode code points.", http://en.wikipedia.org/wiki/UTF-8.
Indeed, when you look at how UTF8 encodes multibyte codepoints the codepoint order will be preserved.
Dawid Weiss (@dweiss) (migrated from JIRA)
Moving to 3.4, if at all :)
David Smiley (@dsmiley) (migrated from JIRA)
I'm looking forward to these FST API improvements. It's a bit obtuse for something that is basically a SortedMap.
Dawid Weiss (@dweiss) (migrated from JIRA)
I haven't had the time to work on that, sorry. I'll publish what I have on github, if you want to give it a spin. Essentially I wanted to tackle the following issues:
Only once this is implemented it makes sense to incorporate add-ons Mike worked on (during-the-construction pruning, different arc storage formats). So, even if it's essentially a sortedmap, there is a fair bit of complexity involved if you want to make it efficient :)
Robert Muir (@rmuir) (migrated from JIRA)
3.6 pruning: can we push this out to 4.0?
Dawid Weiss (@dweiss) (migrated from JIRA)
Absolutely. This is far-future. I wouldn't even schedule it for 4.0 - I simply won't have the time for rewriting all the code that Mike and you created on top of the FST class :)
Dawid Weiss (@dweiss) (migrated from JIRA)
Won't have the time for this in any near future.
The current API is still marked
@experimental
, so I think there's still time to fiddle with it. I've been using the current API for some time and I do have some ideas for improvement. This is a placeholder for these – I'll post a patch once I have a working proof of concept.Migrated from LUCENE-3206 by Dawid Weiss (@dweiss), 1 vote, resolved Apr 20 2012 Attachments: LUCENE-3206.patch Linked issues:
4755