apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.46k stars 980 forks source link

FieldCache should include a BitSet for matching docs [LUCENE-2649] #3723

Closed asfimport closed 13 years ago

asfimport commented 13 years ago

The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.

This should be changed to return an object representing the values and a BitSet for all valid docs.


Migrated from LUCENE-2649 by Ryan McKinley (@ryantxu), resolved Mar 25 2011 Attachments: LUCENE-2649-FieldCacheWithBitSet.patch (versions: 6) Linked issues:

asfimport commented 13 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

See some discussion here: http://search.lucidimagination.com/search/document/b6a531f7b73621f1/trie_fields_and_sortmissinglast

asfimport commented 13 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

This patch replaces the cached primitive[] with a CachedObject. The object heiarch looks like this:

    public abstract static class CachedObject { 

  }

  public abstract static class CachedArray extends CachedObject {
    public final Bits valid;
    public CachedArray( Bits valid ) {
      this.valid = valid;
    }
  };

  public static final class ByteValues extends CachedArray {
    public final byte[] values;
    public ByteValues( byte[] values, Bits valid ) {
      super( valid );
      this.values = values;
    }
  };
  ...

Then this @deprecates the getBytes() classes and replaces them with getByteValues()

  public ByteValues getByteValues(IndexReader reader, String field)
  throws IOException;

  public ByteValues getByteValues(IndexReader reader, String field, ByteParser parser)
  throws IOException;

then repeat for all the other types!

All tests pass with this patch, but i have not added any tests for the BitSet (yet)

If people like the general look of this approach, I will clean it up and add some tests, javadoc cleanup etc

asfimport commented 13 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

A slightly simplified version

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

That looks exactly like I proposed it!

The only thing: For DocTerms the approach is not needed? You can check for null, so the Bits interface is not needed. As the OpenBitSets are created with the exact size and don't need to grow, you can use fastSet to speed up creation by doing no bounds checks.

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

When this is committed, we can improve also some Lucene parts: FieldCacheRangeFilter does not need to do extra deletion checks and instead use the Bits interface to find missing/non-valued documents. Lucene's sorting Collectors can be improved to have a consistent behaviour for missing values (like Solr's sortMissingFirst/Last).

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Looks great!

Should we make it optional, whether the valid bitset should be computed? Many apps wouldn't need it, so it just ties up (admittedly smallish amounts of) RAM unnecessarily?

Lucene's sorting Collectors can be improved to have a consistent behaviour for missing values (like Solr's sortMissingFirst/Last).

+1

Shouldn't we pull Solr's sortMissingFirst/Last down into Lucene?

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Should we make it optional, whether the valid bitset should be computed? Many apps wouldn't need it, so it just ties up (admittedly smallish amounts of) RAM unnecessarily?

+1 we can save that overhead and high level apps can enable it by default if needed.

asfimport commented 13 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Should we make it optional, whether the valid bitset should be computed?

The trick is how to implement that (unless you mean just set it to true/false for all fields at once). Putting a flag on the FieldCache.getXXX methods is insufficient. Only the application knows if some of it's future uses of that field will require the bitset for matching docs, but it's Lucene that's often making the calls to the field cache.

Perhaps FieldCache.Parser was originally just too narrow in scope - it should have been a factory method for handling all decisions about creating and populating a field cache entry?

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Perhaps FieldCache.Parser was originally just too narrow in scope - it should have been a factory method for handling all decisions about creating and populating a field cache entry?

I guess we need to be able to manually configure FieldCache with some kind of FieldType. There have been several issues mentioning this and it keeps coming up again and again. I think it is just time to rethink Fieldable / Field and move towards some kind of flexible type definition for Fields in Lucene. A FieldType could then have a FieldCache Attribute which contains all necessary info including the parser and flags like the one we are talking about. Yet, before I get too excieted about FieldType, yeah something with a wider scope than FieldCache.Parser would work in this case. I don't know how far the FieldType is away but it can eventually replace whatever is going to be implemented here in regards to that flag.

I think by default we should not enable the Bits feature but it must be explicitly set via whatever mechanism we gonna use.

asfimport commented 13 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

I guess we need to be able to manually configure FieldCache with some kind of FieldType.

I don't know how well that would work. For one, there's only one FieldCache, so configuring it with anything seems problematic. Also, if I have to list out all the fields I'm going to use, that's another big step backwards.

A factory would be a pretty straightforward way to increase the power, by allowing users to populate the entry through any mechanism, and optionally do extra calculations when the entry is populated (think statistics, sum-of-squares, etc).

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Also, if I have to list out all the fields I'm going to use, that's another big step backwards.

I don't think that this is needed at all neither would it be a step backwards - not even near to that. But since we aren't on an issue about FieldType lets just drop it...

A factory would be a pretty straightforward way to increase the power, by allowing users to populate the entry through any mechanism, and optionally do extra calculations when the entry is populated (think statistics, sum-of-squares, etc).

Whatever you call it (using Factory is fine) but isn't that what you mentioned to be insufficient? I mean this is something you would pass to a FieldCache.getXXX, right?

asfimport commented 13 years ago

Shai Erera (@shaie) (migrated from JIRA)

One thing I've wanted to do for a long time, but didn't get to doing it, is open up FieldCache to allow the application to populate the entries from other sources - specifically pyloads. I wrote a sorting solution which relies solely on payloads, and wanted to contribute it to Lucene, but due to lack's of FieldCache hook points, I didn't find the time to do the necessary refactoring.

Sorting based on payloads-data has several advantages:

  1. It's much faster to read than iterating on the lexicon and parsing the term values into sortable values.
  2. If your application needs to cater sort over 10s of millions of documents, or if it needs to keep its RAM usage low, you can do the sort while reading the payload data as the search happens. It's faster than if it was in RAM, but the current FieldCache does not allow you to sort w/o RAM consumption.
  3. You don't inflate your lexicon w/ sort values, affecting other searches. In some situations, you can add a unique term per document for the sort values (such as when sorting by date and require up to a millisecond precision).

I'm bringing it up so that if you consider any refactoring to FieldCache, I'd appreciate if you can keep that in mind. If the right hooks will open up, I'll make time to contribute my sort-by-payload package. If you don't, then it'll need to wait until I can find the time to do the refactoring.

asfimport commented 13 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Whatever you call it (using Factory is fine) but isn't that what you mentioned to be insufficient? I mean this is something you would pass to a FieldCache.getXXX, right?

I was suggesting handling it the same way as FieldCache.Parser - it's set on the SortField. But instead of just being able to control parsing of a term (which is too limited), it needs to be able to control everything. (This would solve Shai's needs too)

asfimport commented 13 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

open up FieldCache to allow the application to populate the entries from other sources

+1

specifically payloads

If CSF did not exist, I'd be totally on board with this... but it looks to be right around the corner now. Are there any advantages to using payloads over CSF for fieldcache population?

asfimport commented 13 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

This is a band-aid, but we could consider adding something like:

  public void setCacheValidBitsForFields( Set<String> names );

on FieldCache, then checking if the field is in that set before making the BitSet

When solr reads the schema, it could look for any fields have sortMissingLast and then call:

  FieldCache.DEFAULT.setCacheValidBitsForFields()

The factory idea also sounds good, but i don't see how would work without big big changes

asfimport commented 13 years ago

Shai Erera (@shaie) (migrated from JIRA)

Are there any advantages to using payloads over CSF for fieldcache population?

Well .. payloads already exist (in my code :)), while CSF is "just around the corner" for a long time. While the two ultimately achieve the same goal, CSF is more generic than just payloads, and if we'd want to take advantage of it w/ FieldCache, I assume we'll need to make more changes to FieldCache, because w/ CSF, people can store arbitrary byte[] and request to cache them. So sorting data is a subset of CSF indeed, but I think the road to CSF + CSF-FieldCache integration is long. But perhaps I'm not up-to-date and there is progress / someone actually working on CSF?

Anyway, opening up FC to read from payloads seems to me a much easier solution, because besides reading the stuff from the payload, the rest of the classes continue to work the same (TopFieldCollector, Comparators etc.).

Maybe a slight change to SortField will be required as well though, not sure yet.

asfimport commented 13 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

Uwe: "For DocTerms the approach is not needed..."

Ya I realized this after looking at the patch I first submitted. In the first patch, the cache holds a CachedObject rather then just an Object. In the second, I changed back to just an Object so it does not need to wrap the DocTerms or DocTermsIndex

For the RangeFilter, with optional Bits calculation, that could would look somethign like:

        LongValues cached = FieldCache.DEFAULT.getLongValues(reader, field, (FieldCache.LongParser) parser);
        final long[] values = cached.values;
        if( cached.valid == null ) {
          // ignore deleted docs if range doesn't contain 0
          return new FieldCacheDocIdSet(reader, !(inclusiveLowerPoint <= 0L && inclusiveUpperPoint >= 0L)) {
            `@Override`
            boolean matchDoc(int doc) {
              return values[doc] >= inclusiveLowerPoint && values[doc] <= inclusiveUpperPoint;
            }
          };
        }
        else {
          final Bits valid = cached.valid;
          return new FieldCacheDocIdSet(reader, true) {
            `@Override`
            boolean matchDoc(int doc) {
              return valid.get(doc) && values[doc] >= inclusiveLowerPoint && values[doc] <= inclusiveUpperPoint;
            }
          };
        }
asfimport commented 13 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

public void setCacheValidBitsForFields( Set<String> names );

Solr doesn't even know all of the fields at the time it reads it's schema. And even if it did... this would seem to break multi-core or anything trying to have more than one index where the fields are different. Seems like this needs to be passed down via SortField, just like FieldCache.Parser. A factory makes this a more generic method than adding additional params to SortField every time we think of something like this... then we can add stuff like getFieldCacheParser() and other stuff to the factory.

asfimport commented 13 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

oh right – thats true. Is a global flag sufficient?

In lucene it could default to false and in solr default to true.

I know we don't want to just keep adding more things to memory, but I'm not sure there is a huge win by selectively enabling and disabling some fields.

asfimport commented 13 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

oh right - thats true. Is a global flag sufficient?

Yeah, solr could just always default it to on. We don't know what kind of ad-hoc queries people will throw at solr and the 3% size increase (general case 1/32) seems completely worth it.

asfimport commented 13 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

I added a static flag to CachedArray:

  public abstract static class CachedArray {
    public static boolean CACHE_VALID_ARRAY_BITS = false;

    public final Bits valid;
    public CachedArray( Bits valid ) {
      this.valid = valid;
    }
  };

and then set it to true in the SolrCore static initalizer.

If folks are ok with this approach, I'll clean up the javadocs etc

asfimport commented 13 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

FYI, I like the idea of revisiting the FieldCache, but i don't see a straightforward path.

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I am against the configuration option to enable the additional BitSet. The problem is that you cannot control it for each usage for the FieldCache, as it is a static flag. We agreed in the past that we will remove all static defaults from Lucene (e.g. BQ.maxClauseCount) together with sytem properties. This flag can cause strange problems with 3rd party code (like when you lower the BQ maxClauseCount and suddenly your queries fail).

The overhead by the OpenBitSet is very marginal (for integers only 1/32, as Yonik said). If you have memory problems with the FieldCache, these 1/32 would not hurt you, as you should think about your whole configuration then (liek moving from ints to shorts or something like that).

So: Please don't add any static defaults or sysprops! Please, please, please!

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I was suggesting handling it the same way as FieldCache.Parser - it's set on the SortField. But instead of just being able to control parsing of a term (which is too limited), it needs to be able to control everything. (This would solve Shai's needs too)

We started down this path with #1906 - you could pass some *UnInverter on the sort field if i remember right, so that pretty much everything could be overridden. It has come up a lot - we really need this level of customizability eventually.

asfimport commented 13 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

I'm all for dropping the static flag and always calculating the valid bits – it makes things accurate with minimal cost.

I am sympathetic to folks who don't want this, and I'm not sure the cleanest way to support both options, or even if it is actually worthwhile.

Do people see this 'option' as a showstopper? If so, is there an easy way to configure? without statics, the flag would need to be fetched from each parser, and the parser does not know what FieldCache it is used from (using FieldCache.DEFAULT is just as bad as the static flag IIUC)

asfimport commented 13 years ago

Marvin Humphrey (migrated from JIRA)

> So: Please don't add any static defaults or sysprops! Please, please, please!

+1

No global variables which control behavior, please.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I know it's only 3% (for ints... 12.5% for bytes), but, 3% here, 3% there and suddenly we're talking real money.

Lucene can only stay lean and mean if we don't allow these little 3% losses here and there!!

Let's try to find some baby-step (even if not clean – we know FieldCache, somehow, needs to be fixed more generally) for today?

asfimport commented 13 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Let's try to find some baby-step (even if not clean - we know FieldCache, somehow, needs to be fixed more generally) for today?

The cheapest option might be:

  public interface Parser extends Serializable {
    public boolean recordMissing();
  }

A better option is to replace FieldCache.Parser in SortField to be FieldCache.EntryCreator.

Oh, and if we're recording all the set bits, it would be really nice to record both

Both should be zero or non-measurable cost (a counter++ that does not produce a data dependency can be executed in parallel on a free int execution unit)

asfimport commented 13 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

Are people generally ok with the idea of global on/off? I think that is a reasonable approach... I agree that we should avoid static fields to control behavior. But do we avoid it at the cost of not allowing the option, or waiting till we rework FieldCache?

If the consensus is that FieldCache needs to be reworked before somethign like this could be added, that's fine... i'll move on to other things. Any relatively easy suggestions for how to enable the option without a global static? (Note that FieldCache is already a global static – at leaset FieldCache.DEFAULT is referenced a lot)

Perhaps this could/should live in /trunk until a cleaner solution is viable?

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I am against that option! No static defaults! (and if it must be there - default it to true on Lucene, too).

the number of values set

This is OpenBitSet.cardinality() ? I dont think we should add this extra cost during creation, as it can be retrieved quite easy if really needed.

asfimport commented 13 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

I like the idea of optionally caching the numdocs and unique values – that would make sorting by this field faster – the ArrayValues class could be easily augmented with this.

The problem with augmenting the Parser class as you suggest is that we would have to rejiggy everything that touches parser. We would need different default classes for things that want or don't want the missing records. How do we handle this big:

if (parser == null) {
        try {
          return wrapper.getIntValues(reader, field, DEFAULT_INT_PARSER);
        } catch (NumberFormatException ne) {
          return wrapper.getIntValues(reader, field, NUMERIC_UTILS_INT_PARSER);      
        }
      }

yuck

asfimport commented 13 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

If ya care - don't pass a null parser! Otherwise you get the default.

This is OpenBitSet.cardinality()

Which isn't free... and calculating it over and over again is silly if you care about those numbers.

I dont think we should add this extra cost during creation,

I don't think it will add extra cost. I could be wrong, but I don't think it will be measurable.

asfimport commented 13 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

If ya care - don't pass a null parser! Otherwise you get the default.

What if I care, but somethign else (that does not care) asks for the value first? Seems odd to have so much depend on who asks for the value first

A better option is to replace FieldCache.Parser in SortField to be FieldCache.EntryCreator.

How would that work? What if a filter creates the cache before the SortField?

asfimport commented 13 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

What if I care, but somethign else (that does not care) asks for the value first? Seems odd to have so much depend on who asks for the value first

As long as it can be passed everywhere that matters, then it's up to the application - which knows if it ever needs the missing values or not for that field. For solr, we could make it configurable per-field... but I'd prob default it to ON to avoid unpredictable weirdness.

What if a filter creates the cache before the SortField?

If we have a filter that uses the field cache, then it should also be able to specify the same stuff that SortField can.

asfimport commented 13 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

I agree that we should avoid static fields to control behavior. But do we avoid it at the cost of not allowing the option, or waiting till we rework FieldCache?

I agree with this sentiment - progress, not perfection. Being able to turn it on or off for everything in the process is better than nothing at all.

asfimport commented 13 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

How would that work?

We could start off simple - add only recordMissing functionallity and punt on the rest, while still leaving a place to add it.

class FieldCache {

  public static class EntryCreator {
    public boolean recordMissing() {
      return false;
    }

    public abstract Parser getParser();
  }

Not even sure if a whole hierarchy is needed at this point... in the future, we'd prob want

  public static EntryCreatorInt extends EntryCreator {
    public IntValues getIntValues(IndexReader reader, String field) {... code currently in FieldCacheImpl that fills the fieldCahe...}
    ...
  }
asfimport commented 13 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

Maybe, but I'm still not sure this cleans things up enough to be worth the trouble – ideally the API should be easy to have consistent results. I don't like that it would be too easy to mess things up if you the application does not use the same parser from various components (that may be in different libraries etc). Conceptually it makes sense to have settings about what is or is not cached attached to the FieldCache itself, not to the things that ask the FieldCache for its values – and letting whoever asks first set the behavior for the next guy who asks (regardless of what they ask for!).

If we are going to make it essentially required to always pass in the right Parser/EntryCreator, we should at least remove all the ways of not passing one in – since that call is saying "use what ever is there, and the next guy who asks should be ok with it too"

Does something like the EntryCreator idea fix – or at least begin to fix – the other FieldCache issues? If not, is it really worth introducing just to avoid a static variable?

I think the best near term option is live with the static initializer, and fix it when the we rework the FieldCache to fix a host of other issues. For solr the default will be set to always calculate, for lucene... we will let Mike and Uwe duke it out :)

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Supporting different parsers is not an issue at all. You can call getBytes() with different parsers, you simply create two entries in the cache, as each parser produces a different cache instance. And getBytes() without parser is also fine, as then you get the default parser from the cache (which would not create a third instance!). - [Parser is part of the cache key]

asfimport commented 13 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

I thought of an optimization that could reduce memory usage...

If all non-deleted documents have a value, we don't need a real BitSet – just a Bits implementation that always returns true.

That should save 3% (or 12.5%) here and there.


On other thing to consider... do we want to remove the getXXXX functions that do not pass in a Parser? passing in null, is equivalent?

asfimport commented 13 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

Here is the code for ByteValues that:

  1. optionally stores the BitSet via static config
  2. does not cache a real BitSet unless only some docs match
  3. calculates numDocs/numTerms
    `@Override`
    protected ByteValues createValue(IndexReader reader, Entry entryKey) throws IOException {
      Entry entry = entryKey;
      String field = entry.field;
      ByteParser parser = (ByteParser) entry.custom;
      if (parser == null) {
        return wrapper.getByteValues(reader, field, FieldCache.DEFAULT_BYTE_PARSER);
      }
      int numDocs = 0;
      int numTerms = 0;
      int maxDoc = reader.maxDoc();
      final byte[] retArray = new byte[maxDoc];
      Bits valid = null;
      Terms terms = MultiFields.getTerms(reader, field);
      if (terms != null) {
        final TermsEnum termsEnum = terms.iterator();
        final Bits delDocs = MultiFields.getDeletedDocs(reader);
        final OpenBitSet validBits = new OpenBitSet( maxDoc );
        DocsEnum docs = null;
        try {
          while(true) {
            final BytesRef term = termsEnum.next();
            if (term == null) {
              break;
            }
            final byte termval = parser.parseByte(term);
            docs = termsEnum.docs(delDocs, docs);
            while (true) {
              final int docID = docs.nextDoc();
              if (docID == DocsEnum.NO_MORE_DOCS) {
                break;
              }
              retArray[docID] = termval;
              validBits.set( docID );
              numDocs++;
            }
            numTerms++;
          }
        } catch (StopFillCacheException stop) {}

        // If all non-deleted docs are valid we don't need the bitset in memory
        if( numDocs > 0 && CachedArray.CACHE_VALID_ARRAY_BITS ) {
          boolean matchesAllDocs = true;
          for( int i=0; i<maxDoc; i++ ) {
            if( !delDocs.get(i) && !validBits.get(i) ) {
              matchesAllDocs = false;
              break;
            }
          }
          if( matchesAllDocs ) {
            valid = new Bits.MatchAllBits( maxDoc );
          }
          else {
            valid = validBits;
          }
        }
      }
      if( numDocs < 1 ) {
        valid = new Bits.MatchNoBits( maxDoc );
      }
      return new ByteValues( retArray, valid, numDocs, numTerms );
    }
asfimport commented 13 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

Any thoughts on this?

I think the best move forward is to: a. optimize as much as possible b. drop the no-parser function option c. optionally store the bitset via static config (ugly, but lesser of many ugly options) d. set lucene default=false (actually I don't care) e. set solr default=true

Unless there are objections, I will clean up the patch, fix javadoc, tests, etc

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Also set the Lucene default to true, as I want to improve sorting and FCRF.

asfimport commented 13 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

Here is a (hopefully) final patch that adds a bunch of tests to exercise the the 'valid' bits (and check that MatchAll is used when appropriate)

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi Ryan,

few comments:

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Also set the Lucene default to true

Please don't!

as I want to improve sorting and FCRF.

But: sorting, FCRF must continue to work if the app chooses not to load valid bits, right?

Other feedback on current patch:

It looks like the valid bits will not reflect deletions (by design), right? Ie caller cannot rely on valid always incorporating deleted docs. (Eg the MatchAll opto disregards deletions, and, a reopened segment can have new deletions yet shares the FC entry).

The static config still also bothers me... and, going that route means we must agree on a default (which is looking hard!).

What if we:

This way if an app "messes up", they do not end up double-storing the actual values, ie the worst that happens is they have to re-invert just to generate the valid bits. Even that should be fairly rare, ie, if they use MissingStringLastComparator it'll init both values & valid bits entries in the cache on the first go.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Also set the Lucene default to true, as I want to improve sorting and FCRF.

I know it's only 3% (for ints... 12.5% for bytes), but, 3% here, 3% there and suddenly we're talking real money.

I'm having trouble understanding the use case for this bitset.

The jira issue says to add a bitset, but doesnt explain why.

The linked thread talks about this being useful for sorting missing values last, but I don't think this justifies increasing the size of fieldcache by default.

asfimport commented 13 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

Here is a new patch that removes the static config. Rather then put a property on Parser class, I added a class:

  public abstract static class CacheConfig {
    public abstract boolean cacheValidBits();
  }

and this gets passed to the getXXXValues function:

ByteValues getByteValues(IndexReader reader, String field, ByteParser parser, CacheConfig config)

I think this is a better option then adding a parameter to Parser since we can have an easy upgrade path. Parser is an interface, so we can not just add to it without breaking compatibility. To change things in 4.x, 3.x should have an upgrade path.

I took Mike's suggestion and include the CacheConfig hashcode in the Cache key – however, I don't cache the Bits separately since this is an edge case that should be avoided, but at least does not fail if you are not consistent.

This does cache a MatchAllBits even when 'cacheValidBits' is false, since that is small (a small class with one int)


*  We don't have to `@Deprecate` for 4.0 - just remove it, and note this in MIGRATE.txt. (Though for 3.x we need the deprecation, so maybe do 3.x patch first, then remove deprecations for 4.0?).

My plan was to apply with deprecations to 4.x, then merge with 3.x. Then replace the calls in 4.x, then remove the old functions. Does this sound reasonable?

I would like this to get in 3.x since we could then remove many solr types in 4.x and have a 3.x migration path.

  • FieldCache.EntryCreator looks orphan'd?

dooh, thanks

It looks like the valid bits will not reflect deletions (by design), right? Ie caller cannot rely on valid always incorporating deleted docs. (Eg the MatchAll opto disregards deletions, and, a reopened segment can have new deletions yet shares the FC entry).

Right, the ValidBits are only checked for docs that exists (and the FC values are only set for docs that exists – this has not changed), and may contain false positives for deleted docs. I think this is OK since most use cases (i can think of) deal with deletions anyway. Any ideas how/if we should change this? (I did not realize that the FC is reused after deletions – so clever)


I'm having trouble understanding the use case for this bitset.

My motivation is for supporting the supportMissingLast feature in solr sorting (that could now be pushed to lucene). For example if I have a bunch of documents and only some have the field "bytes" – sorting 'bytes desc' works great, but sorting 'bytes asc' puts all the documents that do not have the field 'bytes' first since the FieldCache thinks they are all zero.

If we get this working in solr, we can deprecate and delete all the "sortable" number fields and have that same functionality on Trie* fields.

asfimport commented 13 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

If folks think that being able to tell a real "0" from a missing value is not useful for Lucene, we could extend Ryan's CacheConfig to include a factory method that creates / populates ByteValues, IntValues, etc. Then all the bitset stuff could be kept in Solr only. I'm sensitive about pushing stuff into Lucene that is only useful for Solr.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think this is a better option then adding a parameter to Parser since we can have an easy upgrade path. Parser is an interface, so we can not just add to it without breaking compatibility. To change things in 4.x, 3.x should have an upgrade path.

Hmm... I'd rather make an exception to 3.x, ie, allow the addition of this method to the interface, than confuse the 4.x API, going forward, with 2 classes?

Creating a custom FieldCache parser is an extremely advanced use case... very few users do this, and those that do will grok this method?

However, I don't cache the Bits separately since this is an edge case that should be avoided, but at least does not fail if you are not consistent.

This makes me nervous since it can now lead to further cases of field cache insanity, ie, you loaded it once w/o the valid bits, and again w/ the valid bits, and now your values array is taking up 2X the RAM.

It's already bad enough that FC allows one kind of insanity :)

This does cache a MatchAllBits even when 'cacheValidBits' is false, since that is small (a small class with one int)

Hmm... but if I pass false here, it shouldn't spend any time allocating the bit set, building it, checking the bit set for "all bits set", etc.?

*  We don't have to `@Deprecate` for 4\.0 - just remove it, and note this in MIGRATE\.txt\. (Though for 3\.x we need the deprecation, so maybe do 3\.x patch first, then remove deprecations for 4\.0?)\.

My plan was to apply with deprecations to 4.x, then merge with 3.x. Then replace the calls in 4.x, then remove the old functions. Does this sound reasonable?

OK that sounds like a good plan!

Right, the ValidBits are only checked for docs that exists (and the FC values are only set for docs that exists -- this has not changed), and may contain false positives for deleted docs. I think this is OK since most use cases (i can think of) deal with deletions anyway. Any ideas how/if we should change this?

I think this is the right approach – expecting FC's valid bits to take deletions into account is too much. We have IR.getDeletedDocs for this.

But, eg this means classes like FCRF will still have to consult deleted docs.

Really, "in general" we need a better way for the query execution path to enforce deleted docs. Eg if the FCRF will be AND'd w/ a query that's already excluding del docs then it need not be careful about deletions...

(I did not realize that the FC is reused after deletions -- so clever)

Ha! There was a time when it didn't ;)

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

My motivation is for supporting the supportMissingLast feature in solr sorting (that could now be pushed to lucene).

If folks think that being able to tell a real "0" from a missing value is not useful for Lucene, we could extend Ryan's CacheConfig to include a factory method that creates / populates ByteValues, IntValues, etc. Then all the bitset stuff could be kept in Solr only. I'm sensitive about pushing stuff into Lucene that is only useful for Solr.

I'm very much +1 for making this (exposing thea valid bitset) possible in Lucene.

Users have asked over time how they can tell if a given doc has a field value.

And being able to distinguish missing values, eg to sort them last, or to do something else, is useful. Once we do this we should also [eventually] move "sort missing last" capability into Lucene's comparators.