Open asfimport opened 9 years ago
Adrien Grand (@jpountz) (migrated from JIRA)
Hmm actually it looks to me that having a positive value is not necessary as the only thing we are doing with the result of the hash is to and it with the bloom size, which would work fine with a negative number too.
Robert Tarrall (@tarrall) (migrated from JIRA)
After sleeping on it... if we need a positive value, "hash = hash & Integer.MAX_VALUE" would be the correct way to force it positive, rather than using a magic number.
That said, yeah, I'm not able to follow the logic well enough to know whether it needs to be positive, or even an integer. Overall it seems like using all 32 bits available from the hashing function would be a win.
Reindexing some data in the DataStax Enterprise Search product (which uses Solr) led to these stack traces:
ERROR [Lucene Merge Thread #13430] 2015-09-08 11:14:36,582 CassandraDaemon.java (line 258) Exception in thread Thread[Lucene Merge Thread #13430,6,main] org.apache.lucene.index.MergePolicy$MergeException: java.lang.AssertionError at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:545) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518) Caused by: java.lang.AssertionError at org.apache.lucene.codecs.bloom.FuzzySet.mayContainValue(FuzzySet.java:216) at org.apache.lucene.codecs.bloom.FuzzySet.contains(FuzzySet.java:165) at org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat$BloomFilteredFieldsProducer$BloomFilteredTermsEnum.seekExact(BloomFilteringPostingsFormat.java:351) at org.apache.lucene.index.BufferedUpdatesStream.applyTermDeletes(BufferedUpdatesStream.java:414) at org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:283) at org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:3838) at org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:3799) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3651) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
In tracking down the cause of the stack trace, I noticed this: https://github.com/apache/lucene-solr/blob/trunk/lucene/codecs/src/java/org/apache/lucene/codecs/bloom/FuzzySet.java#L164
It is possible for the Murmur2 hash to return Integer.MIN_VALUE (e.g. when hashing "WeH44wlbCK"). Multiplying Integer.MIN_VALUE by -1 returns Integer.MIN_VALUE again, so the "positiveHash >= 0" assertion at line 217 fails.
We could special-case Integer.MIN_VALUE, map it to 42 or some other magic number... since the same "* -1" logic appears on line 236 perhaps it should be part of the hash function?
Migrated from LUCENE-6788 by Robert Tarrall (@tarrall), updated May 09 2016