Segment-level Bloom filters [LUCENE-4069]

asfimport commented 12 years ago

An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access.

Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests.

Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments

Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU

Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with "_blm" on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly.

Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

Migrated from LUCENE-4069 by Mark Harwood (@markharwood), 4 votes, resolved Aug 02 2012 Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java (versions: 4), PrimaryKeyPerfTest40.java Linked issues:

SOLR-1375

asfimport commented 12 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Initial patch

asfimport commented 12 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

I've ported this Bloom Filtering code to work as a 4.0 Codec now. I see a 35% improvement over standard Codecs on random lookups on a warmed index.

I also notice that the PulsingCodec is no longer faster than standard Codec - is this news to people as I thought it was supposed to be the way forward?

My test rig (adapted from Mike's original primary key test rig here http://blog.mikemccandless.com/2010/06/lucenes-pulsingcodec-on-primary-key.html) is attached as a zip. The new BloomFilteringCodec is also attached here as a patch.

Searches against plain text fields also look to be faster (using AOL500k queries searching Wikipedia English) but obviously that particular test rig is harder to include as an attachment here.

I can open a seperate JIRA issue for this 4.0 version of the code if that makes more sense.

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I see a 35% improvement over standard Codecs on random lookups on a warmed index.

Impressive! This is for primary key lookups?

It looks like the primary keys are GUID-like right? (Ie randomly generated). I wonder if they had some structure instead (eg '%09d' % (id++)) how the results would look...

I also notice that the PulsingCodec is no longer faster than standard Codec - is this news to people as I thought it was supposed to be the way forward?

That's baffling to me: it should only save seeks vs Lucene40 codec, so on a cold index you should see substantial gains, and on a warm index I'd still expect some gains. Not sure what's up...

I can open a seperate JIRA issue for this 4.0 version of the code if that makes more sense.

I think it's fine to do it here? Really 3.6.x is only for bug fixes now ... so I think we should commit this to trunk.

I wonder if you can wrap any other PostingsFormat (ie instead of hard-coding to Lucene40PostingsFormat)? This way users can wrap any PF they have w/ the bloom filter...

Can you use FixedBitSet instead of OpenBitSet? Or is there a reason to use OpenBitSet here...?

asfimport commented 12 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

My current focus is speeding up primary key lookups but this may have applications outside of that (Zipf tells us there is a lot of low frequency stuff in free text).

Following the principle of the best IO is no IO the Bloom Filter helps us quickly understand which segments to even bother looking in. That has to be a win overall.

I started trying to write this Codec as a wrapper for any other Codec (it simply listens to a stream of terms and stores a bitset of recorded hashes in a ".blm" file). However that was trickier than I expected - I'd need to encode a special entry in my blm files just to know the name of the delegated codec I needed to instantiate at read-time because Lucene's normal Codec-instantiation logic would be looking for "BloomCodec" and I'd have to discover the delegate that was used to write all of the non-blm files.

Not looked at FixedBitSet but I imagine that could be used instead.

asfimport commented 12 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Update- I've discovered this Bloom Filter Codec currently has a bug where it doesn't handle indexes with >1 field. It's probably all tangled up in the "PerField..." codec logic so I need to do some more digging.

asfimport commented 12 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Fixed the issue with >1 field in an index. Tests on random lookups on Wikipedia titles (unique keys) now show a 3 x speed up for a Bloom-filtered index over standard 4.0 Codec for fully warmed indexes.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

hey mark, this looks very interesting. I wonder if that also helps indexing in terms of applying deletes. did you test that at all or is that something are looking into as well? For deletes this could be very handy since bloom filters should help a lot when a given key is likely to only in one segment.

Regarding the code, I think BloomFilter should be like Pulsing and only be a PostingsFormat. You should make the BloomFilterPostingsFormat abstract and pass in the "wrapped" postings format and delegate independent of the field. PostingsFormat is per field and is resolved by the codec. An actual implementation should subclass the abstract BloomFilterPostingsFormat that way you have a fixed PostingsFormat per "delegate" and don't need to deal with "which field got which delegate" bla bla.

We should not have a Bloomfilter codec but rather swap in the bloomfilter postings format in test via random codec. I also wonder if we can extract a "bloomfilter" class into utils that builds an BloomFilter this could be useful elsewhere.

asfimport commented 12 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

I wonder if that also helps indexing in terms of applying deletes. did you test that

I've not looked into that particularly but I imagine this may be relevant.

Thanks for the tips for making the patch more generic. I'll get on it tomorrow and changed to FixedBitSet while I'm at it.

I also wonder if we can extract a "bloomfilter" class into utils

There's some reusable stuff in this patch for downsizing the Bitset (according to desired saturation levels) having accumulated a stream of values.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Thanks for the tips for making the patch more generic. I'll get on it tomorrow and changed to FixedBitSet while I'm at it.

YW! A couple of things regarding your patch. I think you should write a header for the bloomfilter file using CodecUtils.writeHeader() to be consistent. Once you are ready you should also look at the other *PostingsFormat impls how we document the file formats there. We try to have versioned file formats instead of the monolithic one on the website. I saw some minor things like

finally {
  if (output != null)  {
    output.close();
  }
}

you can use IOUtils.close*() for that which handles some exception cases etc.

I briefly looked at your reuse code for TermsEnum and it seems it has the same issues as pulsing has (sovled). When you get a TermsEnum that is infact a Bloomfilter wrapper but wraps another codec underneath you keep on switching back an forth creating new delegated TermsEnum instances. You can look at PulsingPostingsReader and in particular at PulsingEnumAttributeImpl how we solved that in Pulsing.

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

It looks like MurmurHash.java is missing from the patch?

Also, with #5127 landing it looks like the patch no longer compiles (sorry!)...

asfimport commented 12 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Updated to work with trunk.

Changed to use FixedBitSet
Is now a PostingsFormat abstract base class
Added missing MurmurHash class

TODOs

Move Bloom filter logic to common utils classes
Use Service Providers for pluggable choice of hash algos?
Expose settings for memory/saturation

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

hey mark,

I think you should really not provide any field handling at all. this should be done in user code ie. a general codes/postings format per field. What would be very useful here is integration into RandomCodec so we can randomly swap it in. Same is true for the hash algos & memory settings, I think they should be part of the abstract class ctor and if a user wants to use its own algo they should write their own codec ie. subclass. this stuff is very expert and apps like solr can handle this on top.

I am still worried about the TermsEnum reuse code, are you planning to look into this?

you should also add license headers to the files you are adding, note that we don't do javadoc license headers anymore so they should just be block comments.

asfimport commented 12 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

I think you should really not provide any field handling at all.

By that I think you mean keep the abstract BloomFilteringPostingsFormatBase and dispense with the BloomFilteringLucene40Codec (and BloomFilteredLucene40PostingsFormat?). I was trying to limit the extensions apps would have to write to use this service (1 custom postings format subclass, 1 custom Codec subclass and 1 custom SPI config file) but I can see that equally we shouldn't offer implementations for all the many different service permutations.

I'll look at adding something to RandomCodec for Bloom-plus-random delegate PostingsFormat.

I am still worried about the TermsEnum reuse code, are you planning to look into this?

bq. you keep on switching back an forth creating new delegated TermsEnum instances

I'm not sure what you mean in creating new delegated TermsEnum instances? In my impl of"iterator(TermsEnum reuse)" I take care to unwrap my wrapper for TermsEnum to find the original delegate's TermsEnum and then call the delegateTerms iterator method with this object as the reuse parameter. At this stage shouldn't the delegate Terms just recycle that unwrapped TermsEnum as per the normal reuse contract when no wrapping has been done?

you should also add license headers to the files you are adding

Will do.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I was trying to limit the extensions apps would have to write to use this service (1 custom postings format subclass, 1 custom Codec subclass and 1 custom SPI config file)

As far as Codec goes, we don't need to provide one for specialized per-field PostingsFormats.

Users can just use an anonymous Lucene40Codec, which already has a hook for doing stuff like this (a protected method you can override to changeup postings format per-field).

example that sets SimpleText on just the "id" field:

    indexWriterConfig.setCodec(new Lucene40Codec() {
      `@Override`
      public PostingsFormat getPostingsFormatForField(String field) {
        if ("id".equals(field)) {
          return new SimpleTextPostingsFormat();
        } else {
          return super.getPostingsFormatForField(field);
        }
      } 
    });

this is only necessary to set on the indexwriterconfig for writing: PerFieldPostingsFormat records for that id field that it was encoded with "SimpleText"

when the segment is opened, assuming "SimpleText" is in the classpath and registered, it will just work.

from the solr side, they dont need to set any codec either, as its schema already uses this hook, so there its just:

<fieldType name="string_simpletext" class="solr.StrField" postingsFormat="SimpleText"/>

asfimport commented 12 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Thanks for the comment, Rob. While the choice of Codec can be an anonymous inner class, resolving the choice of PostingsFormat is trickier. BloomFilterPostingsFormat is now intended to wrap any another choice of PostingsFormat and Simon has suggested leaving the Bloom support purely abstract. However, as an end user if I want to use Bloom support on the standard Lucene codec I would then have to write one of these:

public class MyBloomFilteredLucene40Postingsextends BloomFilteringPostingsFormatBase {

  public MyBloomFilteredLucene40Postings() {
    //declare my choice of PostingsFormat to be wrapped and provide a unique name for this combo of Bloom-plus-delegate
    super("myBL40", new Lucene40PostingsFormat());
  }
}

The resulting index files are then named [segname]_myBL40.[filetypesuffix]. At read-time the "myBL40" bit of the filename is used to lookup via Service Provider registrations the decoding class so "com.xx.MyBloomFilteredLucene40Postings" would need adding to a o.a.l.codecs.PostingsFormat file for the registration to work.

I imagine Bloom-plus-Lucene40Postings would be a common combo and if both are in core it would be annoying to have to code support for this in each app or for things like Luke to have to have classpaths redefined to access the app-specific class that was created purely to bind this combo of core components.

I think a better option might be to change the Bloom filtering base class to record the choice of delegate PostingsFormat in it's own "blm" file at write-time and instantiate the appropriate delegate instance at read-time using the recorded name. The BloomFilteringBaseClass would need changing to a final class rather than an abstract so that core Lucene could load it as the handler for [segname]_BloomPosting.xxx files and it would then have to examine the [segname].blm file to discover and instantiate the choice of delegate PostingsFormat using the standard service registration mechanism. At write-time clients would need to instantiate the BloomFilterPostingsFormat, passing a choice of PostingsFormat delegate to the constructor. At read-time Lucene core would invoke a zero-arg constructor. I'll look into this as an approach.

asfimport commented 12 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Aaaargh. Unless I've missed something, I have concerns with the fundamental design of the current Codec loading mechanism.

It seems too tied to the concept of a ServiceProvider class-loading mechanism, forcing users to write new SPI-registered classes in order to simply declare what amount to index schema configuration choices.

Example: If I take Rob's sample Codec above and choose to use a subtly different configuration of the same PostingsFormat class for different fields it breaks:

      Codec fooCodec=new Lucene40Codec() {
        `@Override`
        public PostingsFormat getPostingsFormatForField(String field) {
          if ("text".equals(field)) {
            return new FooPostingsFormat(1);
          }
          if ("title".equals(field)) {
            //same impl as "text" field, different constructor settings        
            return new FooPostingsFormat(2);
          }          
          return super.getPostingsFormatForField(field);
        } 
      };

This causes a file overwrite error as PerFieldPostingsFormat uses the same name from FooPostingsFormat(1) and FooPostingsFormat(2) to create files. In order to safely make use of differently configured choices of the same PostingsFormat we are forced to declare a brand new subclass with a unique new service name and entry in the service provider registration. This is essentially where I have got to in trying to integrate this Bloom filtering logic.

This dependency on writing custom classes seems to make everything a bit fragile, no? What hope has Luke got in opening the average index without careful assembly of classpaths etc? If I contrast this with the world of database schemas it seems absurd to have a reliance on writing custom classes with no behaviour simply in order to preserve a configuration of an application's schema settings. Even an IOC container with XML declarations would offer a more agile means of assembling pre-configured beans rather than relying on a Service Provider mechanism that is only serving as a registry of classes.

Anyone else see this as a major pain?

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I'm not sure this is true: e.g. if your postings format requires parameters to decode the segment, then this enforces that it records said parameters, e.g. Pulsing records these parameters.

Codec parameters are at index-time, at read-time its your responsibility to be able to decode them solely from the index (this enforces that there doesnt need to be a crazy matching of user configuration at write and read time).

asfimport commented 12 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

This fails if you add docs with title and text fields:

      Codec fooCodec=new Lucene40Codec() {
        `@Override`
        public PostingsFormat getPostingsFormatForField(String field) {
          if ("text".equals(field)) {
            return new MemoryPostingsFormat(false);
          }
          if ("title".equals(field)) {
            return new MemoryPostingsFormat(true);
          }          
          else {
            return super.getPostingsFormatForField(field);
          }
        } 
      };

Exception in thread "main" java.io.IOException: Cannot overwrite: C:\temp\luceneCodecs_2_Memory.ram

This also fails:

       Codec fooCodec=new Lucene40Codec() {
        SimpleTextPostingsFormat theSimple = new SimpleTextPostingsFormat();
        `@Override`
        public PostingsFormat getPostingsFormatForField(String field) {
          if ("text".equals(field)) {
            return new SimpleTextPostingsFormat();
          }
          if ("title".equals(field)) {
            return new SimpleTextPostingsFormat();
          }          
          else {
            return super.getPostingsFormatForField(field);
          }
        } 
      };

with Exception in thread "main" java.io.IOException: Cannot overwrite: C:\temp\luceneCodecs_1_SimpleText.pst

Whereas sharing the same instance of a PostingsFormat class across fields works:

      Codec fooCodec=new Lucene40Codec() {
        SimpleTextPostingsFormat theSimple = new SimpleTextPostingsFormat();
        `@Override`
        public PostingsFormat getPostingsFormatForField(String field) {
          if (("text".equals(field))|| ("title".equals(field))) {
            return theSimple;
          }          
          else {
            return super.getPostingsFormatForField(field);
          }
        } 
      };

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

As far as your problem with parameters, this is actually just a limitation I created (on accident) of PerFieldPostingsFormat on #5127. Because we use the postingsformat name as the segment suffix, its not possible to have e.g. id field with Pulsing(1) and date field with Pulsing(2).

I'll open an issue and fix this, again its just an issue with PerFieldPostingsFormat. you should be able to use the same name here with different configs, as long as your PostingsFormat writes what it needs to read the files. Thanks for bringing this up.

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

When I run all tests with the bloom 4.0 postings format (ant test-core -Dtests.postingsformat=BloomFilteredLucene40PostingsFormat), I see a lot of failures..., eg lots of CheckIndex failures like:

   [junit4]   1> java.lang.RuntimeException: vector term=[e4 b8 80 74 77 6f] field=textField2Utf8 does not exist in postings; doc=0
   [junit4]   1>    at org.apache.lucene.index.CheckIndex.testTermVectors(CheckIndex.java:1440)
   [junit4]   1>    at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:575)
   [junit4]   1>    at org.apache.lucene.util._TestUtil.checkIndex(_TestUtil.java:186)
   [junit4]   1>    at org.apache.lucene.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:587)
   [junit4]   1>    at org.apache.lucene.index.TestFieldsReader.afterClass(TestFieldsReader.java:72)
   [junit4]   1>    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   [junit4]   1>    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

But also other spooky failures. Are these known issues?

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Mark: I opened #5162.. I broke it, so I'll fix it :)

asfimport commented 12 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

its just an issue with PerFieldPostingsFormat

OK, thanks. My guess is you'll effectively be having to supplement postingsformat.getName() with object-instanceID in file names.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Right, its not bad. Separately I'm not happy that its a little trappy in situations like your ThisToo.java,

its certainly easy and safe to continue to use identityhashmap but I think we will need to require equals/hashcode on postingsformats.

Otherwise its going to make the situation trappy in the future, e.g. we will want to add mechanisms to solr so that you can specify these parameters in the schema.xml (today you cannot), and this would create massive amounts of files if we did so (add to the fact it disables compound files by default and its a disaster...)

asfimport commented 12 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

When I run all tests with the bloom 4.0 postings format (ant test-core -Dtests.postingsformat=BloomFilteredLucene40PostingsFormat),

Thanks for the pointer on targeting codec testing. I have another patch to upload with various tweaks e.g. configurable choice of hash functions, RandomCodec additions so I will concentrate testing on that before uploading.

asfimport commented 12 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

This is looking more promising.

Running "ant test-core -Dtests.postingsformat=TestBloomFilteredLucene40Postings" now passes all tests but causes OOM exception on 3 tests:

TestConsistentFieldNumbers.testManyFields
TestIndexableField.testArbitraryFields
TestIndexWriter.testManyFields

Any pointers on how to annotate or otherwise avoid the BloomFilter class for "many-field" tests would be welcome. These are not realistic tests for this class (we don't expect indexes with 100s of primary-key like fields).

In this patch I've

added an SPI lookup mechanism for pluggable hash algos.
documented the file format
fixed issues with TermVector tests
changed the API

To use: BloomFilteringPostingFormat now takes a delegate PostingsFormat and a set of field names that are to have bloom-filters created. Fields that are not listed in the filter set can be safely indexed as per normal and doing so is beneficial because it allows filtered and non filtered field data to co-exist in the same physical files created by the delegate PostingsFormat.

asfimport commented 12 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Added missing class

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I don't think the abstract class should be registered in the SPI.

Instead i think the concrete Bloom+Lucene40 that you have in tests should be moved into src/java and registered there, just call it Bloom40 or something. The abstract api is still available for someone that wants to do something more specialized.

This is just like how pulsing (another wrapper) is implemented.

As far as disabling this for certain tests, import o.a.l.util.LuceneTestCase.SuppressCodecs and put something like this at class level:

@SuppressCodecs("Bloom40")
public class TestFoo...

@SuppressCodecs({"Bloom40", "Memory"})
public class TestBar...

The strings in here can be codecs or postings formats

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Seeing the tests in question though, i dont think you want to disable this for these entire test classes.

We dont have a way to disable this on a per-method basis: and I think its generally not possible because many classes create indexes in @BeforeClass etc.

An alternative would be to just pick this less often in RandomCodec: see the SimpleText hack :)

asfimport commented 12 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Instead i think the concrete Bloom+Lucene40 that you have in tests should be moved into src/java and registered there

What problem would that be trying to solve? Registration (or creation) of any BloomFilteringPostingsFormat subclasses is not necessary to decode index contents. Offering a "Bloom40" would only buy users a pairing of Lucene40Postings and Bloom filtering but they would still have to declare which fields they want Bloom filtering on at write time. This isn't too hard using the code in the existing patch:

        final Set<String>bloomFilteredFields=new HashSet<String>();
        bloomFilteredFields.add(PRIMARY_KEY_FIELD_NAME);

        iwc.setCodec(new Lucene40Codec(){
          BloomFilteringPostingsFormat postingOptions=new BloomFilteringPostingsFormat(new Lucene40PostingsFormat(), bloomFilteredFields);
          `@Override`
          public PostingsFormat getPostingsFormatForField(String field) {
            return postingOptions;
          }          
        });

No extra subclasses/registration required here to read the index built with the above setup.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I dont understand why this handles fields. Someone should just pick that with perfieldpostingsformat.

So you have the abstract wrapper(takes the wrapped postings format, and a String name), not registered. And you have a concrete impl registered that is just abstractWrapper(lucene40, "Bloom40"): done.

asfimport commented 12 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

An alternative would be to just pick this less often in RandomCodec: see the SimpleText hack

Another option might be to make the TestBloomFilteredLucene40Postings pick a ludicrously small Bitset sizing option for each field so that we can accommodate tests that create silly numbers of fields. The bitsets being so small will just quickly reach saturation and force all reads to hit the underlying FieldsProducer.

asfimport commented 12 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

I dont understand why this handles fields. Someone should just pick that with perfieldpostingsformat.

That would be inefficient because your PFPF will see BloomFilteringPostingsFormat(field1 + Lucene40) and BloomFilteringPostingsFormat(field2 + Lucene40) as fundamentally different PostingsFormat instances and consequently create multiple files named differently because it assumes these instances may be capable of using radically different file structures. In reality, the choice of BloomFilter with field 1 or BloomFilter with field 2 or indeed no BloomFilter does not fundamentally alter the underlying delegate PostingFormat's file format - it only adds a supplementary "blm" file on the side with the field summaries. For this reason it is a mistake to configure seperate BloomFilterPostingsFormat instances on a per-field basis if they can share a common delegate.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

That would be inefficient because your PFPF will see BloomFilteringPostingsFormat(field1 + Lucene40) and BloomFilteringPostingsFormat(field2 + Lucene40) as fundamentally different PostingsFormat instances and consequently create multiple files named differently because it assumes these instances may be capable of using radically different file structures.

But adding per-field handling here is not the way to solve this: its messy.

Per-Field handling should all be handled at a level above in PerFieldPostingsFormat.

To solve what you speak of we just need to resolve #5165. Then multiple postings format instances that are 'the same' will be deduplicated correctly.

asfimport commented 12 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

To solve what you speak of we just need to resolve #5165.

Presumably the main objective here is that in order to cut down on the number of files we store, content consumers of various types should aim to consolidate multiple fields' contents into a single file (if they share common config choices).

Then multiple postings format instances that are 'the same' will be deduplicated correctly.

The complication in this case is that we essentially have 2 consumers (Bloom and Lucene40), one wrapped in the other with different but overlapping choices of fields e.g we want a single Lucene40 to process all fields but we want Bloom to handle only a subset of these fields. This will be a tough one for PFPF to untangle while we are stuck with a delegating model for composing consumers.

This may be made easier if instead of delegating a single stream we have a stream-splitting capability via a multicast subscription e.g. Bloom filtering consumer registers interest in content streams for fields A and B while Lucene40 is consolidating content from fields A, B, C and D. A broadcast mechanism feeds each consumer a copy of the relevant stream and each consumer is responsible for inventing their own file-naming convention that avoids muddling files.

While that may help for writing streams it doesn't solve the re-assembly of "producer" streams at read-time where BloomFilter absolutely has to position itself in front of the standard Lucene40 producer in order to offer fast-fail lookups.

In the absence of a fancy optimised routing mechanism (this all may be overkill) my current solution was to put BloomFilter in the delegate chain armed with a subset of fieldnames to observe as a larger array of fields flow past to a common delegate. I added some Javadocs to describe the need to do it this way for an efficient configuration. You are right that this is messy (ie open to bad configuration) but operating this deep down in Lucene that's always a possibility regardless of what we put in place.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

but overlapping choices of fields e.g we want a single Lucene40 to process all fields but we want Bloom to handle only a subset of these fields.

Thats not true: I disagree. Its an implementation detail that Bloom as a postingsformat wraps another one (thats just the abstract implementation), and the file formats should not expose this in general for any format.

This is true for a number of reasons: e.g. in the pulsing case the wrapped writer only gets a subset of the postings: therefore the wrapped writer's files are "incomplete" and an implementation detail.

its enough here that if you have 5 fields: 2 bloom and 3 not, that we detect there are only two postings formats in use, regardless of whether you have 2 or 5 actual object instances.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

And separately, you can always contain the number of files even today by:

using only unique instances yourself when writing (rather than waiting on LUCENE-4093)
using the compound file format.

The purpose of #5165 is just to make this simpler, but I opened it as a separate issue because its really solely an optimization, and only for a pretty rare case where people are customizing the index format for different fields.

asfimport commented 12 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Its true to say that Bloom is a different case to Pulsing - Bloom does not interfere in any with the normal recording of content in the wrapped delegate whereas Pulsing does. It may prove useful for us to mark a formal distinction between these mutating/non mutating types so we can treat them differently and provide optimisations?

And separately, you can always contain the number of files even today by using only unique instances yourself when writing

Contained but not optimal - roughly double the number of required files if I want the common case of a primary key indexed with Bloom. I can't see a way of indexing with Bloom-plus-Lucene40 on field "A" and indexing with just Lucene40 on fields B,C and D and winding up with only one Lucene40 set of files with a common segment suffix. The way I did find of achieving this was to add a "bloomFilteredFields" set into my single Bloom+Lucene40 instance used for all fields. Is there any other option here currently?

Looking to the future, 4093 may have more capabilities at optimising if it understands the distinction between mutating wrappers and non-mutating ones and how they are composed?

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Contained but not optimal - roughly double the number of required files if I want the common case of a primary key indexed with Bloom.

Then use CFS, its optimal always (1).

I really dont think we should make this complex to save 2 or 3 files total (even in a complex config with many fields). Its not worth the complexity.

asfimport commented 12 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Its not worth the complexity

There's no real added complexity in BloomFilterPostingsFormat - it has to be capable of storing blooms for >1 field anyway and using the fieldname set is roughly 2 extra lines of code to see if a TermsConsumer needs wrapping or not.

From a client side you don't have to use this feature - the fieldname set can be null in which case it will wrap all fields sent its way. If you do chose to supply a set the wrapped PostingsFormat will have the advantage of being shared for bloomed and non-bloomed fields. We could add a constructor that removes the set and mark the others "expert".

For me this falls into one of the many faster-if-you-know-about-it optimisations like FieldSelectors or recycling certain objects. Basically a useful hint to Lucene to save some extra effort but one which you dont need to use.

Lucene-4093 may in future resolve the multi-file issue but I'm not sure it will do so without significant complication.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I really dont think we should make this complex to save 2 or 3 files total (even in a complex config with many fields). Its not worth the complexity.

I agree. I think those postings formats should only deal with encoding and not with handling certain fields different. A user / app should handle this in the codec. Ideally you don't have any conditions in the relevant methods like termsConsumer etc.

For me this falls into one of the many faster-if-you-know-about-it optimisations like FieldSelectors or recycling certain objects. Basically a useful hint to Lucene to save some extra effort but one which you dont need to use.

why is this a speed improvement? reading from one file vs. multiple is not really faster though.

Anyway, I think we should make this patch as simple as possible and don't handle fields in the PF. We can still open another issue or wait until #5165 is in to discuss this issue?

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

For me this falls into one of the many faster-if-you-know-about-it optimisations like FieldSelectors or recycling certain objects. Basically a useful hint to Lucene to save some extra effort but one which you dont need to use.

I agree with Simon, its not going to be faster.

Worse, it creates a situation from the per-field perspective where multiple postings formats are sharing the same files for a segment.

This would make it harder to do things like refactorings of codec apis in the future.

So where is the benefit?

asfimport commented 12 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

why is this a speed improvement?

Sorry - misleading. Replace the word "faster" in my comment with "better" and that makes more sense - I mean better in terms of resource usage and reduced open file handles. This seemed relevant given the earlier comments about Solr's use of non-compound files:

[Solr] create massive amounts of files if we did so (add to the fact it disables compound files by default and its a disaster...)

I can see there is a useful simplification being sought for here if PerFieldPF can consider each of the unique top-level PFs presented to it as looking after an exclusive set of files. As the centralised allocator of file names it can then simply call each unique PF with a choice of segment suffix to name its various files without conflicting with other PFs. Lucene 4093 is all about better determining which PF is unique using .equals(). Unfortunately I don't think this approach is sufficiently complex. In order to avoid allocating all unnecessary file names PerFieldPF would have to further understand the nuances of which PFs were being wrapped by other PFs and which wrapped PFs would be reusable outside of their wrapped PF (as is the case with BloomPF's wrapped PF). That seems a more complex task than implementing equals().

So it seems we have 3 options: 1) Ignore the problems of creating too many files in the case of BloomPF and any other examples of "wrapping" PFs 2) Create a PerFieldPF implementation that reuses wrapped PFs using some generic means of discovering recyclable wrapped PFs (i.e go further than what 4093 currently proposes in adding .equals support) 3) Retain my BloomPF-specific solution to the problem for those prepared to use lower-level APIs.

Am I missing any other options and which one do you want to go for?

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

This seemed relevant given the earlier comments about Solr's use of non-compound files:

We can't make "wrong" decisions just because higher level apps make "wrong" decisions. The dependency goes Solr -> Lucene not the other way around. We provide fine grained control when to use CFS ie for smallish segments etc. If you have hundreds of fields all using different PF etc. you have to deal with tons of files but that is to be honest not very likely to be the common case.

Create a PerFieldPF implementation that reuses wrapped PFs using some generic means of discovering recyclable wrapped PFs (i.e go further than what 4093 currently proposes in adding .equals support)

I think we should investigate this further. Lets keep this issue simple and remove the field handling and fix this on a higher level!

asfimport commented 12 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

I've thought some more about option 2 (PerFieldPF reusing wrapped PFs) and it looks to get very ugly very quickly. There's only so much PerFieldPF can do to rationalize a random jumble of PF instances presented to it by clients. I think the right place to draw the line is Lucene-4093 i.e. a simple .equals() comparison on top-level PFs to eliminate any duplicates. Any other approach that also tries to de-dup nested PFs looks to be adding a lot of complexity, especially when you consider what that does to the model of read-time object instantiation. This would be significant added complexity to solve a problem you have already suggested is insignificant (i.e. too many files doesn't really matter when using CFS).

I can remove the per-field stuff from BloomPF if you want but I imagine I will routinely subclass it to add this optimisation back in to my apps.

asfimport commented 12 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Updated as follows:

Extracted Bloom filter functionality as new oal.util.FuzzySet class - the name is changed because Bloom filtering is one application for a FuzzySet, fuzzy count distincts being another.
BloomFilterPostingsFormat now take a factory that can tailor choice of BloomFilter per field (bitset size/saturation settings and choice of hash algo). Provided a default factory implementation.
All Unit tests pass now that I have a test PostingsFormat class that uses v small bitsets where before the many-field unit tests would cause OOM.

Will follow up with benchmarks when I have more time to run and document them. Initial results from my large-scale tests on growing indexes show a nice flat line in the face of a growing index whereas a non-Bloomed index saw-tooths upwards as segments grow/merge.

asfimport commented 12 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Benchmark tool adapted from Mike's original Pulsing codec benchmark. Now includes Bloom postings example.

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I ran a benchmark on 10 M Wikipedia index; for the factory I used createSetBasedOnMemory and passed it 100 MB; I think that's enough to ensure we get the 10% saturation on save ...:

                Task    QPS base StdDev base   QPS bloomStdDev bloom      Pct diff
              Fuzzy1      102.47        3.67       41.95        0.78  -61% -  -56%
              Fuzzy2       38.36        1.76       18.68        0.37  -54% -  -47%
             Respell       89.89        4.38       44.09        0.52  -53% -  -47%
            Wildcard       40.48        2.82       36.20        0.64  -17% -   -2%
        SloppyPhrase        7.96        0.28        8.07        0.07   -3% -    5%
             Prefix3       61.94        5.34       63.35        0.37   -6% -   12%
        TermBGroup1M       71.37        6.79       73.73        1.55   -7% -   16%
          AndHighMed       64.09        5.51       66.73        1.75   -6% -   16%
      TermBGroup1M1P       49.55        3.78       51.75        2.67   -7% -   18%
         AndHighHigh       16.05        1.12       16.77        0.53   -5% -   15%
         TermGroup1M       35.87        3.07       37.56        0.74   -5% -   16%
          OrHighHigh        9.60        1.38       10.15        0.65  -13% -   31%
           OrHighMed       11.93        1.91       12.63        0.93  -15% -   35%
              IntNRQ        9.12        1.25        9.68        0.11   -7% -   24%
                Term      154.55       19.60      165.32        0.97   -5% -   23%
              Phrase       11.40        0.33       12.21        0.18    2% -   11%
            SpanNear        4.31        0.07        4.73        0.03    7% -   12%
            PKLookup      122.78        1.42      145.95        5.22   13% -   24%

Baseline is Lucene40 PostingsFormat even for the id field ... so PKLookup gets a good improvement. This is on an index w/ 5 segments at each level.

Other queries seem to speed up as well (eg Term, Or*).

The queries that rely on Terms.intersect got much worse: is the BloomFilteredFieldsProducer should just pass through intersect to the delegate?

asfimport commented 12 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Interesting results, Mike - thanks for taking the time to run them.

BloomFilteredFieldsProducer should just pass through intersect to the delegate?

I have tried to make the BloomFilteredFieldsProducer get out of the way of the client app and the delegate PostingsFormat as soon as it is safe to do so i.e. when the user is safely focused on a non-filtered field. While there is a chance the client may end up making a call to TermsEnum.seekExact(..) on a filtered field then I need to have a wrapper object in place which is in a position to intercept this call. In all other method invocations I just end up delegating calls so I wonder if all these extra method calls are the cause of the slowdown you see e.g. when Fuzzy is enumerating over many terms. The only other alternatives to endlessly wrapping in this way are: a) API change - e.g. allow TermsEnum.seekExact to have a pluggable call-out for just this one method. b) Mess around with byte-code manipulation techniques to weave in Bloom filtering(the sort of thing I recall Hibernate resorts to)

Neither of these seem particularly appealing options so I think we may have to live with fuzzy+bloom not being as fast as straight fuzzy.

For completeness sake - I don't have access to your benchmarking code but I would hope that PostingsFormat.fieldsProducer() isn't called more than once for the same segment as that's where the Bloom filters get loaded from disk so there's inherent cost there too. I can't imagine this is the case.

BTW I've just finished a long-running set of tests which mixes up reads and writes here: http://goo.gl/KJmGv This benchmark represents how graph databases such as Neo4j use Lucene for an index when loading (I typically use the Wikipedia links as a test set). I look to get a 3.5 x speed up in Lucene 4 and Lucene 3.6 gets nearly 9 x speedup over the comparatively slower 3.6 codebase.

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Interesting results, Mike - thanks for taking the time to run them.

You're welcome!

BloomFilteredFieldsProducer should just pass through intersect to the delegate?

I have tried to make the BloomFilteredFieldsProducer get out of the way of the client app and the delegate PostingsFormat as soon as it is safe to do so i.e. when the user is safely focused on a non-filtered field. While there is a chance the client may end up making a call to TermsEnum.seekExact(..) on a filtered field then I need to have a wrapper object in place which is in a position to intercept this call. In all other method invocations I just end up delegating calls so I wonder if all these extra method calls are the cause of the slowdown you see e.g. when Fuzzy is enumerating over many terms. The only other alternatives to endlessly wrapping in this way are: a) API change - e.g. allow TermsEnum.seekExact to have a pluggable call-out for just this one method. b) Mess around with byte-code manipulation techniques to weave in Bloom filtering(the sort of thing I recall Hibernate resorts to)

Neither of these seem particularly appealing options so I think we may have to live with fuzzy+bloom not being as fast as straight fuzzy.

I think the fix is simple: you are not overriding Terms.intersect now, in BloomFilteredTerms. I think you should override it and immediately delegate and then FuzzyN/Respell performance should be just as good as Lucene40 codec.

For completeness sake - I don't have access to your benchmarking code

All the benchmarking code is here: http://code.google.com/a/apache-extras.org/p/luceneutil/

I run it nightly (trunk) and publish the results here: http://people.apache.org/\~mikemccand/lucenebench/

but I would hope that PostingsFormat.fieldsProducer() isn't called more than once for the same segment as that's where the Bloom filters get loaded from disk so there's inherent cost there too. I can't imagine this is the case.

It's only called once on init'ing the SegmentReader (or at least it better be!).

BTW I've just finished a long-running set of tests which mixes up reads and writes here: http://goo\.gl/KJmGv This benchmark represents how graph databases such as Neo4j use Lucene for an index when loading (I typically use the Wikipedia links as a test set). I look to get a 3.5 x speed up in Lucene 4 and Lucene 3.6 gets nearly 9 x speedup over the comparatively slower 3.6 codebase.

Nice results! It looks like bloom(3.6) is faster than bloom(4.0)? Why is that...

Also I wonder why you see such sizable (3.5X speedup) gains on PK lookup but in my benchmark I see only \~13% - 24%. My index has 5 segments per level...

asfimport commented 12 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

I think the fix is simple: you are not overriding Terms.intersect now, in BloomFilteredTerms

Good catch - a quick test indeed shows a speed up on fuzzy queries. I'll prepare a new patch.

I'm not sure on why 3.6+Bloom is faster than 4+Bloom in my tests. I'll take a closer look at your benchmark.

apache / lucene

Segment-level Bloom filters [LUCENE-4069] #5141