Cleanup suggester API [LUCENE-3807]

asfimport commented 12 years ago

Currently the suggester api and especially TermFreqIterator don't play that nice with BytesRef and other paradigms we use in lucene, further the java iterator pattern isn't that useful when it gets to work with TermsEnum, BytesRef etc. We should try to clean up this api step by step moving over to BytesRef including the Lookup class and its interface...

Migrated from LUCENE-3807 by Simon Willnauer (@s1monw), resolved Mar 07 2012 Attachments: LUCENE-3807_3x.patch, LUCENE-3807.patch (versions: 9)

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

here is a first patch extracting a BytesRefIterator as a common interface directly shared with TermsEnum as a first step. The semantics are taken from TermsEnum directly so nothing changes for TermsEnum and its consumers. The simple next method makes the most of the TermFreqIters trivial.

comments welcome...

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

+1.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I like the patch, but only one thing (its fine to commit it as-is though, we can solve this on another issue, i just couldnt help but notice)

I don't think we should have the BufferedTermFreqIteratorWrapper/etc and the SortedTermFreqIterator marker interface needs to be fixed.

Here are the problems:

Marker interface SortedTermFreqIterator doesn't tell you if its UTF-8 or UTF-16 order. Its implemented by two classes: SortedTermFreqIteratorWrapper, which sorts in UTF-16 order, and HighFrequencyDictionary, which returns terms from the index (so UTF-8 order). The problem is that classes that rely upon sorted order like JaSpell/TST are likely broken already. Fortunately FST/WFST always do their own sort.
Buffering in RAM is not ideal. Instead I think all of these classes should be using our Sort anyway which can spill to disk.

For now could we put the BytesRefList in the suggest package since its only used there? we might not need it after we clean up this sorting stuff in some future issue.

Also I don't think we should factor out the BytesRefIterator. I seriously think its a bad idea to tie our core index Terms enumeration API with the spellcheck API at this time, it would make it hard to change in the future if we need, especially with spellcheck being... needing work :)

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Again i just wanted to mention, i think its fine to commit as-is.

Consider my comments thinking-out-loud.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I committed the patch to trunk and moved the BytesRefList to the suggest package (pkg private) for now. I keep this issue open as I have more iterations / patches. I will backport before I close this issue ie. batch port to 3.x all commits related to this issue. commit revision is 1291418

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think we should go with a simpler solution for the Buffering/SortedIterator... this has caused a few recent test fails.

(latest: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/12476/)

If we are going to replace it anyway with the Sort that spills to disk, cant we just use a simple treemap?

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

just drop it I will replace that stuff with on disk sort tomorrow anyway

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I think we should go with a simpler solution for the Buffering/SortedIterator... this has caused a few recent test fails.

those failures have been fixed..

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

next patch...

cut over Lookup API to use CharSequence instead of string. this allows us to use CharsRef and saves object creation.
removed Sorted marker interface. BytesRefIterator has a getComparator() that can be null to indicate that there is no specified sort order. That way we can compare the Comparator<BytesRef> instance to see if the sort order matches what the impl expects.
added a IOException to Dictionary#getWordsIterator() removing tons of try/catch throw RTE
renamed freq to weight
weight (ie freq) now returns a long instead of a float discouraging floating point number as weights

next steps would be cleaning up the in-memory sorting stuff and use the sorter which goes to disk to do the actual sorting internally if needed

comments welcome

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I agree with removing IOException from getComparator.

I agree also with moving weight to a long... though currently the e.g. solr integrations take floats as input, so this really needs to be listed as a backwards break since it will directly affect users.

I agree with removing the sorted marker interface: its not useful since you don't know the order.

However, I don't think we should add the charsref methods... I think Bytes/Ints/CharsRef should have parallel apis and someone can just call unicodeutil: in general these are reference classes not stringbuffers and we shouldn't encourage abuse via sugar apis. I already have an issue open for fixing, cleaning up, and making those APIs consistent.

I don't think we should add a generics parameter V to Lookup, especially if LookupResult itself is still wired to float. I do think suggesters should be able to return additional data but this needs more thought: its necessary to actually get the additional data to them.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I found another bug (added test and committed fix in r1291826). This would cause an exception in solr if someone called build but the field was empty.

I think before proceeding with refactoring, we really need to beef up tests. I'll help with this.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

new patch cleaning up the api a little more according to roberts comments. I had some leftover CharsRef uses in there and I removed the generics on Lookup. This patch also removes all the additional methods on CharsRef and adds some randomization for CharSequences..

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I think before proceeding with refactoring, we really need to beef up tests.

I will help too while I think the patch is ready... we can go with iterations?

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

so this really needs to be listed as a backwards break since it will directly affect users.

I will note this in the CHANGES once we are done

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I like this latest patch!

I didnt test it, but i think its fine to move forward.

I just mainly want to address the issues of whole suggesters that are not tested at all in this module (#4886) before we did too much more.

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Sorry for being late, work. I like the patch. Comments:

+    public Comparator<BytesRef> getComparator() {
+      return null;
+    }

This shows up in a number of places. I have mixed feelings about certain types having a comparator and others not having it, but it's minor.

BufferingTermFreqIteratorWrapper is a nuisance (buffers in memory). It would be nicer to have a sort on disk if something doesn't support sorted iteration order.

I also wonder float ~~> long = 4 -> 8 bytes... would this count as an incompatible API change (because what used to work for a given amount of RAM won't work anymore -~~ BufferingTermFreqIteratorWrapper again)?

+      if (l1 < l2) {
+        aStop = l1;
+      } else {
+        aStop = l2;
+      }

if I remember correctly Math.min/max are intrinsics, so you can afford to be explicit ;)

Why not a specific covariant here?

-  public Float get(String key) {
+  public Object get(CharSequence key) {

This doesn't seem necessary (lookup accepts a CharSequence?).

`@@` -199,7 +199,7 `@@` public class LookupBenchmarkTest extends LuceneTestCase {
         public Integer call() throws Exception {
           int v = 0;
           for (String term : input) {
-            v += lookup.lookup(term, onlyMorePopular, num).size();
+            v += lookup.lookup(new CharsRef(term), onlyMorePopular, num).size();

I like the rest, including the CharSequenceish evilness of bytesToCharSequence :)

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

This shows up in a number of places. I have mixed feelings about certain types having a comparator and others not having it, but it's minor.

well this an indicator if we know something about the order or not. if you get null there is not order specified...

BufferingTermFreqIteratorWrapper is a nuisance (buffers in memory). It would be nicer to have a sort on disk if something doesn't support sorted iteration order.

this is the ultimate goal... see my comment above (``` next steps would be cleaning up the in-memory sorting stuff and use the sorter which goes to disk to do the actual sorting internally if needed



> I also wonder float -&gt; long = 4 -&gt; 8 bytes... would this count as an incompatible API change (because what used to work for a given amount of RAM won't work anymore – BufferingTermFreqIteratorWrapper again)?

see above

> if I remember correctly Math.min/max are intrinsics, so you can afford to be explicit 

I will upload a new patch - we use this in BytesRefComp too, I think its safe to fix

> This doesn't seem necessary (lookup accepts a CharSequence?).

right - leftover from an old iteration

> Why not a specific covariant here?

that would be Float get(String) or Float get(CharSequence) or... ;)

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

updated patch with davids suggestions

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

following up on this, here is another patch. I cleaned up the APIs even further removing get / add from Lookup which is not really used. TST and Jaspell still support add and get is still implemented on all others but its not part of the interface. The main thing that buggs me with add is that its inconsistent with build since it allows arbitrary output while build uses a Number. Once we need this we can add it back to the interface.

TST and Jaspell are now optimized to not necessarily create new Float objects all the time but share instances if the incoming weight allows it. I also create Number instances based on the range which might safe some more memory while minor IMO.

LookupResult now also returns a long instead of a float to be consistent with the TermFreqIterator.

SortedTermFreqIteratorWrapper is now based on the On-Disk sort. I extended the sorter impl a little to work with Comparator<BytesRef> and fixed the TODO by using BytesRefList with ByteBlockPool internally. WFSTCompletionLookup now also uses the SortedTFIteratorWrapper to remove duplicated code.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

more cleanups... I made BytesRefSorter using BytesRefList (we should maybe rename to bytesrefarray) and in turn use BytesRefIterator instead of the stupid java iterator. So all those classes should be consistent now.

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I looked at your patch briefly, Simon. Random notes:

I'd change currentElement into lastElement or something like that. Otherwise it looks odd to me in the code, as in:

   throw new IndexOutOfBoundsException("index " + pos
       + " must be less than the size: " + currentElement);

typo in "orderdEntries".
I'm very likely paranoid but I'd stick to just one class for storing these:
```
protected Number weightAsNumber(long weight) {
```
Since these are objects the memory gain will most likely be obscured by object alignments and object overhead itself and the downside is that you're using an interface with all call sites that will very likely become megamorphic (so no chances to inline anything). I don't know if it's worth the effort.

I didn't have time to think much about changes to the functional logic; I don't think there were any (and if there were, they should be covered by tests?).

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

merged with trunk.

I added a changes.txt to contrib and fixed the issues david raised.

I think this is ready. once this is in I will backport this to 3x and close the issue

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

i did a quick review: looks good to me

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

committed to trunk.. I think we are ready to backport. I will do that in the next days

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Somewhere along the line we got load(File), load(InputStream), store(File), store(InputStream)... can we improve this?

I think its confusing when writing a new suggester to have all these hooks.

load(File) takes a directory name, so a suggester is free to use multiple files (I am working on one that might use 2 files). so then what is load(InputStream) ?!

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I added methods that used streams. They stored the underlying FST. I personally didn't like the "must point to a directory" approach.

The File(dir) methods got inherited from the original Lookup interface I think. Those saving/reading from streams were not in the interface so they were specific to a concrete implementation. Don't know how it looks now.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Right I'm not concerned about concrete impls. Concrete can do whatever it wants :)

In the actual lookup interface, load/store from stream was added underneath this issue:

http://svn.apache.org/viewvc/lucene/dev/trunk/modules/suggest/src/java/org/apache/lucene/search/suggest/Lookup.java?r1=1236012&r2=1291418&diff_format=h

I'm not opposed to this change: but if we are going to require that implementations only use a single file, then we should remove the File ones and just let caller deal with the inputstream... does that make sense?

Currently its impossible to tell which one will be called!

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

If you say you already have an implementation that could use more than one file then it'd be silly to enforce a single file format. I agree too many options is not good, in particular it's painful for potential implementors.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Well I switched to a single file already, so ignore that implementation :)

But my concern is mostly about API confusion for implementing suggesters: if we require a single file then lets just have InputStream. Otherwise, lets remove InputStream from Lookup, (of course a specific concrete Impl that only uses one file is still free to provide that)

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

+1 for streams. They're a lot more flexible than Files.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

the main reason why I moved this to the interface was that I needed to do some extra processing in the stream ie. calc a checksum and I wanted to write stuff directly to HDFS etc. I think we should go for streams or we pass in a lucene directory while I think unless really needed we should stick with a stream.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

here is a patch that removes the file based store / load methods. Once thing I don't like about this is that we use one an the same file name in solr to store stuff. We could possibly fix this and make it bw compatible by pulling a file name from the factory for instance but even that is kind of flaky. Maybe somebody has a better idea.

Yet I think one of the biggest issues here is that we don't really have a header on the actual implementation. Ie you could simply load a FST from a FST suggester into WFST but the results would be bogus. I think we should add real headers to the files to fail early and give good error messages.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

here is a patch against 3.x

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

here is a new patch removing the file based store / load methods and adding a fileName method to solrs factories to maintain compatibility. I will commit this soon.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

backported committed patches to 3.x in rev 1297946

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I committed the last patch to trunk and backported to 3x.

thanks guys

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thanks Simon! hard work but I expect there is more.

Hopefully this module will grow into something much bigger...

asfimport commented 9 years ago

Arcadius Ahouansou (@arcadius) (migrated from JIRA)

Hi @s1monw

Sorry for coming late...

weight (ie freq) now returns a long instead of a float discouraging floating point number as weights

There are many cases like with the AnalyzingInfixSuggester where the weight is pre-computed outside of Lucene before ingestion.

Having a double (floating point) makes sense as it would help avoid all sort of rounding/data-loss.

apache / lucene

Cleanup suggester API [LUCENE-3807] #4880