apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.64k stars 1.02k forks source link

LSH Filter [LUCENE-6968] #8025

Closed asfimport closed 8 years ago

asfimport commented 8 years ago

I'm planning to implement LSH. Which support query like this

Find similar documents that have 0.8 or higher similar score with a given document. Similarity measurement can be cosine, jaccard, euclid..

For example. Given following corpus

1. Solr is an open source search engine based on Lucene 2. Solr is an open source enterprise search engine based on Lucene 3. Solr is an popular open source enterprise search engine based on Lucene 4. Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java

We wanna find documents that have 0.6 score in jaccard measurement with this doc

Solr is an open source search engine

It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)


Migrated from LUCENE-6968 by Cao Manh Dat (@CaoManhDat), resolved Jun 14 2016 Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, LUCENE-6968.6.patch, LUCENE-6968.patch (versions: 3) Linked issues:

asfimport commented 8 years ago

Cao Manh Dat (@CaoManhDat) (migrated from JIRA)

Initial patch. Can find similar documents based on Jaccard similarity.

asfimport commented 8 years ago

Cao Manh Dat (@CaoManhDat) (migrated from JIRA)

This new patch include :

asfimport commented 8 years ago

Joel Bernstein (@joel-bernstein) (migrated from JIRA)

@CaoManhDat, interestiing ticket!

Can you describe some of the potential uses for LSH. For example will this be helpful in approximating K nearest neighbor?

asfimport commented 8 years ago

Cao Manh Dat (@CaoManhDat) (migrated from JIRA)

Yes, It kinda like finding K nearest neighbor. But there a different here:

In both case, the result will be rank by distance to the center doc.

asfimport commented 8 years ago

Tommaso Teofili (@tteofili) (migrated from JIRA)

since this has also been mentioned in SOLR-7739 maybe the LSH filter can be used to create a Lucene nearest neighbour Classifier which uses such filter (instead of the existing k nearest neighbour one).

asfimport commented 8 years ago

Tommaso Teofili (@tteofili) (migrated from JIRA)

I was having a second look at this patch, why is LSHSimilarity returning 1 in all its methods ? Is that intended or does it mean it still needs to be implemented?

asfimport commented 8 years ago

Tommaso Teofili (@tteofili) (migrated from JIRA)

@CaoManhDat, I'd like to commit this patch shortly, my only concern is around the similarity implementation, having all methods returning 1 doesn't seem correct to me, or am I missing something ?

asfimport commented 8 years ago

Joel Bernstein (@joel-bernstein) (migrated from JIRA)

HI, I believe there will be some work forthcoming from Alfresco on this ticket. We may want to hold off until those patches go up. Hopefully I will have an update on this soon.

@tteofili, if you feel strongly though that this is ready to commit, then go for it. We can always make adjustments after the committal as well.

asfimport commented 8 years ago

Andy Hind (migrated from JIRA)

Hi

It would be quite common to use min hashing after shingling. At this point the number of possible word combinations vs the size of the hash is important. With shingles of 5 words from 100,000 that is 10e25 combinations. Some naive processing of the \~500k Enron emails (splitting on white space, case folding and 5 word shingles) gives \~52M combinations. So a long hash would be better at 1.8e19. I have not yet looked at a larger corpus.

The LSH query is neat. However the logic can give banding where the last band is uneven. In the patch I think the last band would be dropped unless bands * rows in band = # of hashes

The underlying state of the source filter may also be lost (if using shingling)

I do not believe the similarity is required at all. I think you can get Jaccard distance using constant score queries and disabling coordination on the boolean query.

I went for 128-bit hashes, or a 32 bit hash identifier + 96 bit hash with a bit more flexibility allowing a minimum set of hash values for a bunch of hashes. There is clearly some trade off for speed of hashing and over representing short documents. The minimum set may be a solution to this. I think there is some interesting research there.

I will add my patch inspired by the original .... and apologise for the mixed formatting in advance ......

asfimport commented 8 years ago

Cao Manh Dat (@CaoManhDat) (migrated from JIRA)

What's a wonderful patch. The code is optimized, sure that the the index will be much smaller!

But the patch keep some lowest values for each position, did it affect the formula

 Pr(h(s1) = h(s2)) = Jaccard(s1,s2) 
asfimport commented 8 years ago

Andy Hind (migrated from JIRA)

The argument here says it is pretty much the same.

https://en.wikipedia.org/wiki/MinHash

The plan was to offer both options.

With respect to banding and finding docs related to some start document, the number of hashes may depend on the start document.

Let's start with 5 word shingles, one hash and keep the minimum 100 hash values. For a five word document we get one hash. For a 100 word doc where all the shingles/words are the same we get one hash. For all different shingles we get 96 hashes.

If we have 100 different hashes and keep the lowest one all the above cases end up with 100 hashes.

So back to banding. With minimum sets, you need to look and see how many hashes you really got and then do the banding. Comparing a small documents/snippet (where we get 10 hashes in the fingerprint)with a much larger document (where we get 100 hashes) is an interesting case to consider. Starting with the small document there are fewer bits to match in the generated query. With 100 hashes from the small document I think you end up in the roughly same place, except for small snippets. Any given band is more likely to have the same shingle hashed different ways.

With a 100 hash fingerprint, sampling for 100 words is great but not so great for 100,000 words. With a minimum set we have the option to generate a finger print related to the document length and other features.

There is also an argument for a winnowing approach.

asfimport commented 8 years ago

Cao Manh Dat (@CaoManhDat) (migrated from JIRA)

Thanks for the link. I totally agree that keeping some lowest values for single hash function would be better.

But in the wiki doc. It pointed out that the estimator formulation for "variant with a single hash function" is not same as the estimator formulation for "variant with many hash function". So the generated query must be different for each case.

For example, in case we use single hash function and keep some lowest values :

asfimport commented 8 years ago

Andy Hind (migrated from JIRA)

This comes down to "what is a good estimate of |A U B|" and do we need it for the use case.

For query, the main use case is finding documents like one source document. So we are comparing Doc A with all other documents. What we need is a measure that is fair for A -> B, A -> C. We probably do not care about B -> C. If we take the fingerprint from A and just OR the bits together into a big query we have a consistent measure of similarity of A with any other document. This particular measure is biased. For a start Sim(A, B) is not equal to Sim(B, A). But for this use case that may not matter. This measure contains both inclusion and duplication which may be a good thing. It is also pretty intuitive what it means. This is (A ∩ B)/|A|.

If we want Sim(A, B) = Sim(B, A) then we need some consistent measure/sample of |A U B| to normalise our measure/estimate of A ∩ B. This could be (|A| + |B| - |A ∩ B|), or some similar estimate. We could use the size of the fingerprint sets. We could keep the full ordered set of hashes and have extra statistics like the total number of hashes and total number of unique hashes.

For two short documents, where there are fewer fingerprints than the maximum, we have the full sets. For two larger docs we have an estimate of these based in the min hash sets. You can argue "min of many hashes" is a random sample with replacement and "min set of one hash" is a random sample without replacement; if your hash is good. If the sample is small compared with the set of all hashes the arguments converge.

If we were doing arbitrary comparisons between any pair of documents then we would have to use an unbiased estimator. Finding candidate pairs, moving onto clustering, ...

asfimport commented 8 years ago

Cao Manh Dat (@CaoManhDat) (migrated from JIRA)

Thanks, that make sense!

asfimport commented 8 years ago

Tommaso Teofili (@tteofili) (migrated from JIRA)

attaching a slightly modified version of the last patch:

I've noticed though that the filter doesn't perfectly align the end offset attribute (being beyond the input length), in fact if I run all tests the TestFactories one fails with the following:

Suite: org.apache.lucene.analysis.core.TestFactories
   [junit4]   2> TEST FAIL: useCharFilter=true text='uuzfmo'
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestFactories -Dtests.method=test -Dtests.seed=9CF9D39BDAB31A80 -Dtests.slow=true -Dtests.locale=sv -Dtests.timezone=Asia/Choibalsan -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
   [junit4] FAILURE 13.5s J3 | TestFactories.test <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: endOffset must be <= finalOffset: got endOffset=7 vs finalOffset=6
   [junit4]    >    at __randomizedtesting.SeedInfo.seed([9CF9D39BDAB31A80:14ADEC41744F7778]:0)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:211)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:300)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:304)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:828)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:627)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:525)
   [junit4]    >    at org.apache.lucene.analysis.core.TestFactories.doTestTokenFilter(TestFactories.java:105)
   [junit4]    >    at org.apache.lucene.analysis.core.TestFactories.test(TestFactories.java:58)
   [junit4]    >    at java.lang.Thread.run(Thread.java:745)
   [junit4]   2> NOTE: test params are: codec=Lucene60, sim=RandomSimilarity(queryNorm=true,coord=no): {}, locale=sv, timezone=Asia/Choibalsan
   [junit4]   2> NOTE: Mac OS X 10.11.3 x86_64/Oracle Corporation 1.8.0_45 (64-bit)/cpus=8,threads=1,free=197816752,total=324534272
   [junit4]   2> NOTE: All tests run in this JVM: [TestPatternCaptureGroupTokenFilter, TestSnowballPorterFilterFactory, TestBulgarianStemFilterFactory, TestFactories]
asfimport commented 8 years ago

Robert Muir (@rmuir) (migrated from JIRA)

analysis-common library cannot have any external dependencies.

so If this filter is going to depend on guava, it needs to be its own analysis module. Just like all other analysis modules that have any third party dependencies.

I would try to remove the guava dependency at all costs anyway. It causes hell for developers to depend on guava.

asfimport commented 8 years ago

Robert Muir (@rmuir) (migrated from JIRA)

also, these analyzers should not be tested with queries. Instead, just use the asserts in BaseTokenStreamTestCase to check that the correct tokens are output.

I see some solr code was copied so that things can be tested with queries. I know that solr likes to test this way, but nobody wants to debug a test failure that tests an analyzer this way.

Just keep in mind we have a ton of analyzers, and when things go wrong (and they do), people have to debug them or even do refactorings across hundreds of them. That is why we have a consistent test class (BaseTokenStreamTestCase).

asfimport commented 8 years ago

Tommaso Teofili (@tteofili) (migrated from JIRA)

thanks Robert, I agree on all of your suggestions. The error reported by the TestFactories witnesses what you say, I can work on those more appropriate tests; however I would also like to keep the "query" ones as to keep an eye on whether one of the use case for this filter is supported as expected (a sort of integration test). Removing Guava might of course be possible (and I'd prefer to do that), although rewriting the hashcode stuff is likely to be a bit annoying.

asfimport commented 8 years ago

Robert Muir (@rmuir) (migrated from JIRA)

We don't need any "integration tests" or "end to end tests".

-1 to those. Just think about it. Tokenizers only output one thing, and that is tokens. And that is what we should test.

asfimport commented 8 years ago

Andy Hind (migrated from JIRA)

I agree a pure token stream test makes sense. The only concern I have is about testing token filters chained together. Chaining shingle generation with min hashing requires that the underlying token stream has its state reset correctly for reuse. As I missed this, I added a test to cover it. Is there somewhere else in the test framework that covers this case? Some randomised chaining of filters?? Perhaps chaining is more of a SOLR thing.

I would prefer to stick with a 128/96 bit hash. The link below [1] "suggests" 5-shingles become well distributed. Link [2] says upto 2/3 of all possible trigrams have been seen in 30 years of news articles. So it seems we can expect to see many of the possible 5-shingles. Some bioinformatic use cases may also require this.

[1] http://googleresearch\.blogspot\.co\.uk/2006/08/all-our-n-gram-are-belong-to-you\.html [2] http://homepages\.inf\.ed\.ac\.uk/ballison/pdf/lrec_skipgrams\.pdf

I was not that keen to add Guava! However, it was already there somewhere. I am happy if this moves off into a separate module. I will also look to see how this dependency could be removed.

Perhaps we should have some time to consider how to include the fingerprint length (sum of the min set size over all hashes) to support an unbiased query. An unbiased query would be more difficult to build correctly. Some fingerprint/LSH query support and tests may make sense. Some other statistics may also be useful in generating faster queries that find similar documents using some threshold and probability of meeting that threshold.

asfimport commented 8 years ago

Andy Hind (migrated from JIRA)

@yonik has murmurhash3_x64_128 here https://github.com/yonik/java_util/blob/master/src/util/hash/MurmurHash3.java

asfimport commented 8 years ago

Andy Hind (migrated from JIRA)

After a bit more digging, the single hash and keeping the minimum set can be improved.

See: [1] http://jmlr.org/proceedings/papers/v32/shrivastava14.pdf [2] http://www.auai.org/uai2014/proceedings/individuals/225.pdf

In summary: rather than keep the minimum set, split the hash space up into 500 buckets (for a 500 hash fingerprint) and keep the minimum for each bucket. To fill an empty bucket, take the minimum from the next non-empty bucket on the right with rotation.

asfimport commented 8 years ago

Andy Hind (migrated from JIRA)

I have attached an updated patch.

This addresses the following issues

  1. Support for single hash, split into ranges with a minimum for each range
  2. Remove end to end tests and beefed up unit tests
  3. Remove Guava in favour of Yonik's murmur hash implementation. (Some duplication here with SOLR)
  4. Fixed alignment and "evil" test case issue
  5. TestFactories passes > 200 times (some Japanese Number tokenisation failures)
  6. Fixed formatting

There were issue applying patch 4 on its own or over the previous patch. I believe I have included everything other then the IDE related bits.

asfimport commented 8 years ago

Tommaso Teofili (@tteofili) (migrated from JIRA)

thanks a lot Andy, it generally looks much better, I will have a look at failures on point 5 (and eventually add the IDE fixes back).

asfimport commented 8 years ago

Tommaso Teofili (@tteofili) (migrated from JIRA)

minor fixes to the last patch, apart from that I cannot see any error, even with Japanese locale explicitly set

ant clean test -Dtests.multiplier=1000000 -Dtestcase=MinHashFilterTest -Dtests.locale=ja

Andy Hind do you have the seed of a failed execution ?

asfimport commented 8 years ago

Andy Hind (migrated from JIRA)

Hi Tommaso, the MinHashFilterTest was running fine. It was JapaneseNumberFilter that was failing intermittently. I think on one of the evil test cases.

LongPair should implement equals (and probably hashCode if it will be reused) as it goes into a TreeSet. An over sight on my part.

FWIW, as far as I can tell, the change in patch 6 was included in 5.

asfimport commented 8 years ago

ASF subversion and git services (migrated from JIRA)

Commit 82a9244193ba948142b834ec08e2de0d98cfba9f in lucene-solr's branch refs/heads/master from @tteofili https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=82a9244

LUCENE-6968 - MinHash filter, thanks to Andy Hind and Cao Manh Dat for patches

asfimport commented 8 years ago

Tommaso Teofili (@tteofili) (migrated from JIRA)

I've committed this, thanks to Andy Hind and @CaoManhDat for your patches!

asfimport commented 8 years ago

Andy Hind (migrated from JIRA)

Hi Tommaso - are you planning to merge this to 6.x?

asfimport commented 8 years ago

Tommaso Teofili (@tteofili) (migrated from JIRA)

yes, I plan to merge it to 6.x, I wanted to have a few more runs on Jenkins before merging it back to make sure there're no token filtering level issues.

asfimport commented 8 years ago

ASF subversion and git services (migrated from JIRA)

Commit 82a9244193ba948142b834ec08e2de0d98cfba9f in lucene-solr's branch refs/heads/apiv2 from @tteofili https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=82a9244

LUCENE-6968 - MinHash filter, thanks to Andy Hind and Cao Manh Dat for patches

asfimport commented 8 years ago

ASF subversion and git services (migrated from JIRA)

Commit 96a5595fffd4a4a1c553a987382697cb8a92b354 in lucene-solr's branch refs/heads/branch_6x from @tteofili https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=96a5595

LUCENE-6968 - LSH Filter backported to branch 6.x

asfimport commented 8 years ago

Tommaso Teofili (@tteofili) (migrated from JIRA)

I've backported this to branch 6.x for inclusion in next 6.2 release.

asfimport commented 8 years ago

Varun Thacker (@vthacker) (migrated from JIRA)

Hi Tommaso,

I think we need to fix the CHANGES entry to move the entry to the 6.2 section. Its under 7.0 section currently

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Bulk close resolved issues after 6.2.0 release.

asfimport commented 7 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Is there a JIRA issue yet to expose (and test) this and MinHash in Solr?

asfimport commented 7 years ago

Tommaso Teofili (@tteofili) (migrated from JIRA)

@yonik the MinHash filter can be already used in Solr just like any other TokenFilter using its MinHashFilterFactory; regarding the test, there used to be one to test its end to end usage in one of the first patches, however it was decided to keep only the BaseTokenStreamTestCase ones as to check the filtering was working correctly. Perhaps we can put the old usage test in the context of Solr and add it within the Solr tests.

asfimport commented 7 years ago

Andy Hind (migrated from JIRA)

There are probably some more bits required to integrate minhash with SOLR but it depends on the use case. A SOLR specific ticket would make sense.

Finding likely duplicates/clutering from the whole corpus LSH style I have not thought about.....

"Near duplicates of a document" would make sense and overlaps a bit with MLT. First this requires storing the fingerprint, recovering it in a distributed way and building a query. It probably makes sense to support a big OR constant score query, a big OR constant score query with a minimum match, and a banded query that reflects the usual LSH stuff. An example was in the unit tests that was removed (it needed a bit of refinement to avoid a small bucket). Banding groups random fingerprints into AND queries and ORs these together.

I can provide the query logic.

FYI - adding a 500 term fingerprint increased the index size \~10%

asfimport commented 5 years ago

Mayya Sharipova (@mayya-sharipova) (migrated from JIRA)

Andy Hind  Hello Andy! I have several questions about the implementation of the MinHashFilter, and was wondering if you would be able to answer them. Thanks a lot in advance.

The implementation from 1st original patch where the minimum set is kept is very clear to me, and follows the classic idea of constructing MinHash signature and LSH search after it. But I am having a hard time understanding the final implementation for MinHashFilter.

1) What constitutes the signature of a document? Are these all values stored in the hash table? Doesn't it make a signature too large? Can you please refer the paper that describes this way of constructing minhash signatures.

2) What is the use of withRotation parameter? What is the advantage of using withRotation=true? In the paper you cited: http://www.auai.org/uai2014/proceedings/individuals/225.pdf, they fill empty bins with "value of the closest non-empty bin in the clockwise direction (circular right hand side) added with offset C". In the MinHashFilter implementation values for empty buckets are just blindly copied from non-empty ones, so a lot of buckets with have the same value.

Hopefully the questions make sense. Thanks again in advance.

asfimport commented 5 years ago

Andy Hind (migrated from JIRA)

@mayya-sharipova     Hi Mayya, there is a good review paper here https://arxiv.org/pdf/1408.2927.pdf.  See sections 3.5.1 and 3.5.2 and related references. I have not found the specific comment about bias I was trying to locate.

The handwaving view is that empty or missing hashes are biased for many to many comparisons. It is difficult to tune the hash parameters for a wide mix of doc sizes, and small documents in particular, as the number of hashes increases with doc size over some range. It is better to have some value rather than none. There is an argument about what value should be used but that is less important. Repetition is one way of filling in gaps and making the hash count consistent. For two small docs, there is going to be a bit of asymmetry in the measure whatever you do. In some cases, like containment, the bias may be a good thing :)

Apologies for my slow response.

asfimport commented 5 years ago

Mayya Sharipova (@mayya-sharipova) (migrated from JIRA)

Andy Hind Thanks very much for your answer, it made things more clear.  I still have a couple of additional questions if you don't mind:

1)  The default settings of the filter will produce for each document 512 tokens each of the size 16 bytes, that is approximately 8Kb. Isn't 8Kb too big of a size to be a document's signature?

2) What is the way to combine min_hash tokens to a query for similarity search? Do you have any examples? Is this a work in progress?

Thanks again!

asfimport commented 5 years ago

Andy Hind (migrated from JIRA)

@mayya-sharipova, in answer to your questions:

1) Depends on your view - a lower default would probably make sense. The default of 512  is good  for 3-10+ pages of A4 text - again with an "it depends" caveat ....

2) Query time is covered here https://issues.apache.org/jira/browse/SOLR-12879

    You can probably get away with the default query time tokenisation in SOLR, post BM25.