apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.62k stars 1.02k forks source link

FieldCache should include a BitSet for matching docs [LUCENE-2649] #3723

Closed asfimport closed 13 years ago

asfimport commented 14 years ago

The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.

This should be changed to return an object representing the values and a BitSet for all valid docs.


Migrated from LUCENE-2649 by Ryan McKinley (@ryantxu), resolved Mar 25 2011 Attachments: LUCENE-2649-FieldCacheWithBitSet.patch (versions: 6) Linked issues:

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I am also strongly +1 for the additional Bits interface (as Ryan did, it does not always need to be a real OpenBitSet, so when no deletions and all things set, we can use a dummy one). I had often use cases where i needed the information, if this document really has a value set or not, and i don't use Solr so much.

And being able to distinguish missing values, eg to sort them last, or to do something else, is useful. Once we do this we should also [eventually] move "sort missing last" capability into Lucene's comparators.

+1

I think this is the right approach - expecting FC's valid bits to take deletions into account is too much. We have IR.getDeletedDocs for this.

We don't need to AND them together, maybe simply wrap the OpenBitset by a custom Bits impl, that ands in the getter? But as deletions are separated in IndexReader and the cache can reuse the cache even when new deletions are added, i think keeping it separate is fine.

About the whole bit set: Do we really need to couple the Bits interface to the type? Because if you exchange the parser/native type (e.g. parse ints as byte), the valid docs are still the same, only the native type representation is different. So how about we add a getBits(field) method to FieldCache that returns the valid docs. If field was not yet retrieved as a native type it could throw IllegalStateEx, else it would return the Bits interface (globally, but per field, but not per parser/datatype) created during the last FC polulation run? We have then also the possibility to disable the default generation of Bits and do it lazily (which should run faster, as it does not need to parse the values, only enumerate terms and termdocs).

Really, "in general" we need a better way for the query execution path to enforce deleted docs. Eg if the FCRF will be AND'd w/ a query that's already excluding del docs then it need not be careful about deletions...

Thats another thing, but maybe we remove deleted docs completely from query processing and simply apply it like a filter before the collector. Not sure about the implications and performance.

asfimport commented 14 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

Hmm... I'd rather make an exception to 3.x, ie, allow the addition of this method to the interface, than confuse the 4.x API, going forward, with 2 classes?

That is OK with me. Would be cleaner and simpler. (though semantically it does not make sense to me – why ask the parser what to cache?)

>> This does cache a MatchAllBits even when 'cacheValidBits' is false, since that is small (a small class with one int)

bq. Hmm... but if I pass false here, it shouldn't spend any time allocating the bit set, building it, checking the bit set for "all bits set", etc.?

Well it does not try hard, only if numDocs==maxDocs, it does not look at anything. If the cost of caching new MatchAllBits( maxDocs ) isn't worth occasional win by knowing all the values are valid, then I will remove it.

So how about we add a getBits(field)

Interesting... i'll mess for a bit and let you know what I think :)

rather then throwing an exception, that might be a flag, since I could imagin many thigns would use the Bits if they exist and do something else if they dont

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hmm... I'd rather make an exception to 3.x, ie, allow the addition of this method to the interface, than confuse the 4.x API, going forward, with 2 classes?

Same here, we already defined the FieldCache "interface" as subject to change. Mabye we should simply remove it in trunk and only have a class? This interface was never of any use, because you were not able to supply any other field cache implementation (the DEFAULT field is final because all fields in interfaces are defined as final by the Java Language Spec.

asfimport commented 14 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Regardless of if there is a separate getBits(field), I think we should add/use ByteValues, IntValues, etc. It's just so much more extensible going forward.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Regardless of if there is a separate getBits(field), I think we should add/use ByteValues, IntValues, etc. It's just so much more extensible going forward.

+1

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

So how about we add a getBits(field) method to FieldCache that returns the valid docs.

This would be great!

Assuming we separately cache the valid docs, we could then allow caching only valid docs, for apps that want to know if a doc has a value but do not need the full array of values RAM resident.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

That is OK with me. Would be cleaner and simpler. (though semantically it does not make sense to me - why ask the parser what to cache?)

Yeah that is weird. Maybe we rename Parser -> EntryCreator?

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

We should also allow the parser to stop iterating term without the strange StopFillCacheException (needed for Numeric fields).

asfimport commented 14 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

Regardless of if there is a separate getBits(field), I think we should add/use ByteValues, IntValues, etc. It's just so much more extensible going forward.

If we have a separate getBits( field ) call, should the Bits be added to the XxxValues class? I suspect not.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

If we have a separate getBits( field ) call, should the Bits be added to the XxxValues class? I suspect not.

Yeah I think it need not be.

But, EntryCreator still should be able to state that it'd like the bits computed & cached as a side effect? And, if the bits wasn't already computed, then they'd be computed on-demand? (This enables caching only valid bits and not the values array).

If we do this then we can leave cutover to XXXValues (still a good idea) as a separate issue?

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I think it should be a separate entry in the cache. Only that its only regenerated, if it does not already exist for the field/IR combination. So there are these combinations:

Alternatively:

With a config option, one could switch automatic polulation of Bits off, so in the first combination, only the last call to getBits() would populate bit set.

I all cases this is easily possible by having a separate cache with separate population method for the bits. If some method like getBytes() also populates the Bits, it should simply add the created Bits manually to the cache.

Instead of a global config option, we could simply add a Boolean to the getXxx methods, to tell the cache if it should also populate Bits (if not already done?). The default would maybe false for Lucene, but solr would always pass true.

Does this sound like a plan?

asfimport commented 14 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

If we do this then we can leave cutover to XXXValues (still a good idea) as a separate issue?

I'm already in deep... I'd like to keep the XxxValues in this patch, and use a different one to bring everything up-to-date with the new API in 3.x/4.x

asfimport commented 14 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

If we have a separate getBits( field ) call, should the Bits be added to the XxxValues class? I suspect not.

Seems a lot easier just to put all the optional stuff on the *Values class - more performant too (and avoid more synchronizing lookups)

asfimport commented 14 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

I all cases this is easily possible by having a separate cache with separate population method for the bits.

This seems more complex, and less extensible. What's the issue with just putting the bits reference down on CachedArray?

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

It's not more complicated its easier. The Bits are a real separate thing, its just a cache of the information, if values are assigned or not. It does not depend on the data type like byte, int,... Its just separate. And as said before, if somebody requests the same field in different types, it would only have one bits. Also one could request the Bits alone, or could first request a field without bits (using a boolean) and later again with bits, in which case those are lazily loaded

Th implementation would be simple, similar to the way, how the caches are filled for NumericFields (it adds the values two times to the cache, with the null parser and the real used parser). I this case youl would simply request the bits cache on e.g. the int[] creation, if not availabe, also build the bits in parallel and add to bits cache.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

What's the issue with just putting the bits reference down on CachedArray?

One risk was added insanity cases, ie you looked up once w/ the bits and later w/o and it double-stores the values array.

Another gain of separating the bits retrieval is it becomes possible to get only the valid bits (ie, w/o a value array), for apps that just want to know if a given doc had a field.

But we could probably still achieve these two benefits while using a single class for looking up everything "cached" about a field? Ie, the CachedArray could return non-null bits but null values?

asfimport commented 14 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

But we could probably still achieve these two benefits while using a single class for looking up everything "cached" about a field? Ie, the CachedArray could return non-null bits but null values?

Exactly. And with NRT and increasing number of segments, the number of synchronized lookups per segment could really start to matter.

asfimport commented 14 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

more performant too (and avoid more synchronizing lookups)

This was my motivation for putting it all together and using the options as part of the key. But yes, inconsistent usage will eat up RAM. That is better then my original fear that inconsistent usage would give you unpredictable results!

Also, with the current Cache implementaion, we would need to somehow be able to add two cache entries from within Cache.createEntry() – either rewrite all that or hack in some way to pass the FieldCache to the createEntry method.

Keeping the values and bits in different cache entries is pretty ugly (especially for the normal case where we want them both all the time)

    return new ByteValues() {
      (byte[])caches.get(Byte.TYPE).get(reader, new Entry(field, parser??)),
      (Bits)caches.get(Byte.TYPE).get(reader, new Entry(field, parser??)),
    };

But we could probably still achieve these two benefits while using a single class for looking up everything "cached" about a field? Ie, the CachedArray could return non-null bits but null values?

Brainstorming here... if Parser -> EnteryCreator and the 'EntryCreator.hashCode()' is used as the map key (as it is now)

  public abstract class EntryCreator implements Serializable {
    public boolean cacheValidBits() {
      return false;
    }
    public boolean cacheValues() {
      return true;
    }
  }

The trick would be to use the same key regardless of what we ask for (values but no bits - bits but no values - bits and values, etc) and then fill up whatever is missing if it is not in the existing cache.

That might suggest that the 'class' could be the key, but setting the cacheValidBits/values would need to get implemented by inheratance, so that is out.

other ideas? directions I am not thinking about?

asfimport commented 14 years ago

J.J. Larrea (migrated from JIRA)

I only just waded through this thread, so apologies in advance if this is redundant or off-topic...

It seems to me that there could and should be a standalone enhancement to FieldCache/FCImpl to support Boolean-valued fields.

Since there is no native array-of-bits in Java, it could have the signature:

BitSet getBits(IndexReader reader, String field, BooleanParser parser)  [implementation returning an OpenBitSet for efficiency]

A pre-supplied BooleanParser implementation StringMatchBooleanParser could map any of one of a set of uncased strings to true, and a default subclass eg. DefaultStringMatchBooleanParser could supply { "T", "TRUE", "1", "Y", "YES" } for the set of strings. So the defaulted and typical case getBits( ir, "field" ) would do what one typically expects of boolean-valued fields.

With that in place, then couldn't one simply define a parser that indicates value present for a docID regardless of what the term value is:

public static BooleanParser AlwaysReturnTrueBooleanParser = new BooleanParser() { public boolean parseByte(BytesRef term) { return true; } }

BitSet getValueExists(IndexReader reader, String field) {
   return  getBits( ir, field, AlwaysReturnTrueBooleanParser );
}

Then a client (e.g. FieldComparator implementation) interested in ValueExists values could ask for them, and they would be independently cached from whatever other field type cache(s) were requested on that field by the same or different clients. The only cost would be iterating the Term/docID iterators a second time (as for additional cache variants on the same field) - minor.

Does this make sense?

asfimport commented 14 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

The only cost would be iterating the Term/docID iterators a second time

Actually, that's major - otherwise I would have already just done that and stored it in a separate cache for Solr's needs.

asfimport commented 14 years ago

J.J. Larrea (migrated from JIRA)

Actually, that's major - otherwise I would have already just done that and stored it in a separate cache for Solr's needs.

Is the one-time-per-IndexReader-lifecycle cost of multiplying the cache load time by some factor < 2.0 (since the term values don't need to be decoded), really so terrible that one has to contemplate global state variables, or a constant increase in cache memory, or significant API changes, or the potential for double-allocation (with then an additional 1x cache load time), or increased code complexity, ...? Even with all the lovely Solr support for parallel pre-warming?

asfimport commented 14 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

bq Is the one-time-per-IndexReader-lifecycle cost of multiplying the cache load time by some factor < 2.0 ... really so terrible

it can be... on a big index just iterating all the terms/docs take a long time. Try the LukeRequestHandler on an index with a million+ docs!


Here is different variation, it changes lots but if we are talking about changing Parser from interface to class, then I guess the cat can be out of the bag.

What about something like:

  ...

  public class EntryConfig implements Serializable 
  {
    public Parser getParser() {
      return null;
    }
    public boolean cacheValidBits() {
      return false;
    }
    public boolean cacheValues() {
      return true;
    }

    /**
     * The HashCode is used as part of the Cache Key (along with the field name).  
     * To allow multiple calls with different parameters, make sure the hashCode 
     * does not include the specific instance and parameters.
     */
    public int hashCode()
    {
      return EntryConfig.class.hashCode();
    }
  }

  public abstract class CachePopulator 
  {
    public abstract void fillValidBits(  CachedArray vals, IndexReader reader, String field, EntryConfig creator ) throws IOException;
    public abstract void fillByteValues( CachedArray vals, IndexReader reader, String field, EntryConfig creator ) throws IOException;
    ...
  }

  public abstract CachePopulator getCachePopulator();

...

  public ByteValues getByteValues(IndexReader reader, String field, EntryConfig creator )

...

The field cache implementation would make sure what you asked for is filled up before passing it back (though i think this has some concurrency issue)

  public ByteValues getByteValues(IndexReader reader, String field, EntryConfig config) throws IOException
  {
    ByteValues vals = (ByteValues) caches.get(Byte.TYPE).get(reader, new Entry(field, config));
    if( vals.values == null && config.cacheValues() ) {
      populator.fillByteValues(vals, reader, field, config);
    }
    if( vals.valid == null && config.cacheValidBits() ) {
      populator.fillValidBits(vals, reader, field, config);
    }
    return vals;
  }

The Cache would then delegate the creation to the populator:

    `@Override`
    protected final ByteValues createValue(IndexReader reader, Entry entry, CachePopulator populator) throws IOException {
      String field = entry.field;
      EntryConfig config = (EntryConfig)entry.custom;
      if (config == null) {
        return wrapper.getByteValues(reader, field, new EntryConfig() );
      }
      ByteValues vals = new ByteValues();
      if( config.cacheValues() ) {
        populator.fillByteValues(vals, reader, field, config);
      }
      else if( config.cacheValidBits() ) {
        populator.fillValidBits(vals, reader, field, config);
      }
      else {
        throw new RuntimeException( "the config must cache values and/or bits" );
      }
      return vals;
    }

The fillByteValues would be the same code as always, but I think the CachedArray should make sure the same parser is used everytime

    `@Override`
    public void fillByteValues( CachedArray vals, IndexReader reader, String field, EntryConfig config ) throws IOException
    {
      ByteParser parser = (ByteParser) config.getParser();
      if( parser == null ) {
        parser = FieldCache.DEFAULT_BYTE_PARSER;
      }
      // Make sure it is the same parser
      int parserHashCode = parser.hashCode();
      if( vals.parserHashCode != null && vals.parserHashCode != parserHashCode ) {
        throw new RuntimeException( "Subsequent calls with different parser!" );
      }
      vals.parserHashCode = parserHashCode;
     ...

This is different then the current code where asking for the cached values with two different parsers (that return different hashcodes) will make two entries in the cache.

This approach would let us:

asfimport commented 14 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Now we're talking!

Q: why aren't the CachePopulator methods just directly on EntryConfig - was it easier to share implementations that way or something?

Also:

Anyway, all that's off the top of my head - I'm sure you've thought about it more at this point.

asfimport commented 14 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

Q: why aren't the CachePopulator methods just directly on EntryConfig - was it easier to share implementations that way or something?

Two reasons (but I can be talked out of it)

  1. this approach separates what you are asking for (bits/values/etc) from how they are actually generated (the "populator"). Something makes me uncomfortable about the caller asking for Values needing to also know how they are generated. Seems easy to mess up. With this approach the 'populator' is attached to the field cache, and defines how stuff is read, vs the 'EntryConfig' that defines what the user is asking for (particulary since they may change what they are asking for in subsequent calls)

  2. The 'populator' is attached to the FieldCache so it has consistent behavior across subsequet calls to getXxxxValues(). Note that with this approach, if you ask the field cache for just the 'values' then later want the 'bits' it uses the same populator and adds the results to the existing CachedArray value.

It doesn't seem like we need two methods fillValidBits , fillByteValues

The 'fillValidBits' just fills up the valid bits w/o actually parsing (or caching) the values. This is useful when:

  1. you only want the ValidBits, but not the values (Mike seems to want this)
  2. you first ask for just values, then later want the bits.

Thinking some more, I think the populator should look like this:

public abstract class CachePopulator 
  {
    public abstract ByteValues   createByteValues(   IndexReader reader, String field, EntryConfig config ) throws IOException;
    public abstract ShortValues  createShortValues(  IndexReader reader, String field, EntryConfig config ) throws IOException;
    public abstract IntValues    createIntValues(    IndexReader reader, String field, EntryConfig config ) throws IOException;
    public abstract FloatValues  createFloatValues(  IndexReader reader, String field, EntryConfig config ) throws IOException;
    public abstract DoubleValues createDoubleValues( IndexReader reader, String field, EntryConfig config ) throws IOException;

    public abstract void fillByteValues(   ByteValues   vals, IndexReader reader, String field, EntryConfig config ) throws IOException;
    public abstract void fillShortValues(  ShortValues  vals, IndexReader reader, String field, EntryConfig config ) throws IOException;
    public abstract void fillIntValues(    IntValues    vals, IndexReader reader, String field, EntryConfig config ) throws IOException;
    public abstract void fillFloatValues(  FloatValues  vals, IndexReader reader, String field, EntryConfig config ) throws IOException;
    public abstract void fillDoubleValues( DoubleValues vals, IndexReader reader, String field, EntryConfig config ) throws IOException;

    // This will only fill in the ValidBits w/o parsing any actual values
    public abstract void fillValidBits( CachedArray  vals, IndexReader reader, String field, EntryConfig config ) throws IOException;
  }

The default 'create' implementation could look something like this:

    `@Override`
    public ShortValues createShortValues( IndexReader reader, String field, EntryConfig config ) throws IOException 
    {
      if( config == null ) {
        config = new SimpleEntryConfig();
      }
      ShortValues vals = new ShortValues();
      if( config.cacheValues() ) {
        this.fillShortValues(vals, reader, field, config);
      }
      else if( config.cacheValidBits() ) {
        this.fillValidBits(vals, reader, field, config);
      }
      else {
        throw new RuntimeException( "the config must cache values and/or bits" );
      }
      return vals;
    }

And the Cache 'createValue' would looks somethign like this:

  static final class ByteCache extends Cache {
    ByteCache(FieldCache wrapper) {
      super(wrapper);
    }

    `@Override`
    protected final ByteValues createValue(IndexReader reader, Entry entry, CachePopulator populator) throws IOException {
      String field = entry.field;
      EntryConfig config = (EntryConfig)entry.custom;
      if (config == null) {
        return wrapper.getByteValues(reader, field, new SimpleEntryConfig() );
      }
      return populator.createByteValues(reader, field, config);
    }
  }

thoughts? This would open up lots more of the field cache... so if we go this route, lets make sure it addresses the other issues people have with FieldCache. IIUC, the other big request is to load the values from an external source – that should be possible with this approach.

asfimport commented 14 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Oh... I mis-matched the parens when I was looking at your proposal (hence the confusion).

I think getCachePopulator() should be under EntryConfig - that way people can provide their own (and extend ByteValues to include more info) Otherwise, we'll forever be locked into a lowest common denominator of only adding info that everyone can agree on.

asfimport commented 14 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

I think getCachePopulator() should be under EntryConfig - that way people can provide their own (and extend ByteValues to include more info)

So you think it is better for each call to define how the cache works rather then having that as an attribute of the FieldCache (that could be extended). The on thing that concerns me is that that forces all users of the FieldCache to be in sync.

In this proposal, you could set the CachePopulator on the FieldCache.

Otherwise, we'll forever be locked into a lowest common denominator of only adding info that everyone can agree on.

This is why I just added the 'createXxxxValues' functions on CachePopulator – a subclass could add other values.


It looks like the basic difference between what we are thinking is that the Populator is attached to the FieldCache rather then each call to the FieldCache. From my point of view, this would make it easier for system with a schema (like solr) have consistent results across all calls, rather then making each request to the FieldCache need to know about the schema -> parsers -> populator

but I can always be convinced ;)

asfimport commented 14 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

In this proposal, you could set the CachePopulator on the FieldCache.

Hmmm, OK, as long as it's possible.

From my point of view, this would make it easier for system with a schema (like solr) have consistent results across all calls, rather then making each request to the FieldCache need to know about the schema -> parsers -> populator

I think this may make it a lot harder from Solr's point of view.

Are EntryConfig objects stored as keys anywhere? We need to be very careful about memory leaks.

asfimport commented 14 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

It's all doable though I guess - even if EntryConfig objects are used as cache keys, we could store a weak reference to the solr core. So I say, proceed with what you think will make it easy for Lucene users - and don't focus on what will be easy for Solr.

asfimport commented 14 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

preface, I don't really know how FieldCache is used, so my assumptions could be way off...

In solr, is there one FieldCache for all all cores, or does each core get its own FieldCache?

I figured each core would create a single CachePopulator (with a reference to the schema) and attach it to the FieldCache. If that is not possible, then ya, it will be better to put that in the request.

Are EntryConfig objects stored as keys anywhere? We need to be very careful about memory leaks.

Yes, the EntryConfig is part of the 'Entry' and gets stored as a key.

asfimport commented 14 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

In solr, is there one FieldCache for all all cores, or does each core get its own FieldCache?

There is a single FieldCache for all cores (same as in Lucene).

asfimport commented 14 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Passing it in would also allow a way to get rid of the StopFillCacheException hack for NumericField in the future.

asfimport commented 14 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

Ok, as I look more, I think it may be worth some even bigger changes!

Is there any advantage to having a different map for each Type? The double (and triple) cache can get a bit crazy and lead to so much duplication

What about moving to a FieldCache that is centered around the very basic API:

public <T> T get(IndexReader reader, String field, EntryCreator<T> creator)

Entry creator would be something like

public abstract static class EntryCreator<T> implements Serializable 
  {
    public abstract T create( IndexReader reader, String field );
    public abstract void validate( T entry, IndexReader reader, String field );

    /**
     * NOTE: the hashCode is used as part of the cache Key, so make sure it 
     * only changes if you want different entries for the same field
     */
    `@Override`
    public int hashCode()
    {
      return EntryCreator.class.hashCode();
    }
  }

We could add all the utility functions that cast stuff to ByteValues etc. We would also make sure that the Map does not use the EntryCreator as a key, but uses it to generate a key.

A sample EntryCreator would look like this:

class BytesEntryCreator extends FieldCache.EntryCreator<ByteValues> {

  `@Override`
  public ByteValues create(IndexReader reader, String field) 
  {
    // all the normal walking stuff using whatever parameters we have specified
  }

  `@Override`
  public void validate(ByteValues entry, IndexReader reader, String field) 
  {
    // all the normal walking stuff using whatever parameters we have specified
  }  
}

Thoughts on this approach?

Crazy how a seemingly simple issue just explodes :(

asfimport commented 14 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Hmmm, that would also seem to transform the FieldCache into a more generic index reader cache - not a bad idea!

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

+1

Maybe we should also look at CSF again and what Simon did (#3262). In my opinion, the field cache's public API should not look different from CSF, so one can simply also sort against a CSF.

I know, some people here will hurt me if I suggest to remove tha native arrays and instead provide getter methods like in the ValueSource approach. The native arrays are unflexible and Java 6 will hopefully optimize away the additional method call (at least I have seen no speed penalty when trying with CSF's getter API). Cool things could be done like materializing the FieldCache to disk using mmap by e.g. FileChannel.map(...).order(ByteOrder.BIG_ENDIAN).asFloatBuffer() which is then accessible using get(docId). I tested this and works very fine for sorting in Lucene! Java uses internally source code specialization to return different classes depending on native byte order that access the underlying ByteBuffer directly (not manually combining 4 bytes into a float). So the get(docId) call is only bounds checks and one mmaped memory access.

asfimport commented 14 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

Hmmm, that would also seem to transform the FieldCache into a more generic index reader cache - not a bad idea!

But is it a good one?! This would let the FieldCache just focus on the synchronized cache mechenism, and the each EntryCreator would need to do its own Parsing etc

Anyone know what the deal is with IndexReader:

  /** Expert */
  public Object getCoreCacheKey() {
    return this;
  }
asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Anyone know what the deal is with IndexReader:

It is to share the cache for clones of IndexReaders or when SegmentReaders are reopended with different deleted docs. In this case the underlying Reader is the same, so it should use its cache (e.g. when deleted docs are added, you dont need to invalidate the cache).

For more info, ask Mike McCandless!

asfimport commented 14 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Uwe, I think we need to keep the native arrays.

Java 6 will hopefully optimize away the additional method call

It only does if you have one implementation at the point where it is used. We just got done specializing the ord sorting code with native arrays because of this - the speed hit was really non-trivial, and it happened with the latest versions of all of the JVMs I tested (oracle 1.6, oracle 1.7, ibm 1.6). I see no relief for this issue on the horizon.

asfimport commented 14 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

remove tha native arrays and instead provide getter methods like in the ValueSource approach

What would this look like? Are you suggesting rather then having:

class ByteValues {
  public byte[] values;
};

we have:

class ByteValues {
  public byte getValue( int doc )
};

or are you suggesting like DocValues that has, intVal, longVal, floatVal, even though only one really makes sense?

Is this something that wold need the proposed FieldCache API to change? or could it be implemented via EntryCreator?

If we like the more general cache, that probably needs its own issue (wow scope creep!)

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Yonik: I was expecting this answer...

The reason is that my current contact (it's also your's) has exactly that problem also with norms (but also FC), that they want to lazily load values for sorting/norms (see the very old issue LUCENE-505). At least we should have a TopFieldDocCollector that can alternatively to native arrays also use a ValueSource-like aproach with getter methods - so you could sort against a CSF. Even if it is 20% slower, in some cases thats the only way to get a suitable search experience. Not always speed is the most important thing, sometimes also space requirements or warmup times. I would have no problem with providing both and chosing the implementation that is most speed-effective. So if no native arrays are provided by the FieldCache use getter methods.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I see no relief for this issue on the horizon.

We need to specialize the code... either manually or automatically...

asfimport commented 14 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

If we like the more general cache, that probably needs its own issue

Correct - with a more general fieldCache, one could implement alternatives that MMAP files, etc. But those alternative implementations certainly should not be included in this issue.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I just wanted to mention that, so the design of the new FC is more flexible in that case. I am just pissed of because of these arrays and no flexibility :-(

The FC impl should be in line with the CSF aproach from #3262.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I like where this all is going!![ We can finally fix FC]( We can finally fix FC)!

In my opinion, the field cache's public API should not look different from CSF, so one can simply also sort against a CSF.

+1

From the consumption standpoint, they (FC and CSF) really ought to be one and the same.

What's "unique" about FieldCache is that it derives its values via uninversion... this is nice because there's no index change, but it's slow at reader open time. It's also error-proned (you may hit multiple values per doc, these values may have gone through analysis, etc.)

CSF, instead, actually computes the values during indexing and stores the raw, pre-computed array, in the index.

They are both just different sources for the same thing.

Also, an app should be free to plugin its own external source, and it should present this same "values source" API.

Uwe, I think we need to keep the native arrays.

I think the API should allow for optional retrieval of the backing array (and we should [manually, for today] specialize the sort comparators), but primary access should be a method call eg ByteValues.getValue(int docID).

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Uwe, I think we need to keep the native arrays.

I think the API should allow for optional retrieval of the backing array (and we should [manually, for today] specialize the sort comparators), but primary access should be a method call eg ByteValues.getValue(int docID).

Exactly. Maybe do it like with NIO buffers: they have methods hasArray(), array() and arrayOffset(), the two last ones throw UnsupportedOp, if first is false. We already have quite a lot TopFieldDocCollectors impls as inner classes, a few more choosen by hasArray()... haha :-)

asfimport commented 14 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

It feels like incremental improvement keeps being held hostage... Can't we first just allow the retrieval of ByteValues, etc, that also have Bits on it? Changing everything to go through getValue() should be a separate issue (and CSF isn't even finalized/committed yet).

asfimport commented 14 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

Can't we first just allow the retrieval of ByteValues, etc, that also have Bits on it?

I sure hope so... otherwise, I doubt anything will move forward.

My hope with #3739, is to fix the basic problem – in a way that does not close the door to the improvements we want to make in the future. I think/hope we are close.

If I'm missing something let me know, so I can stop wasting my time.

(I don't even know what CSF is, and do not want to learn right now)

asfimport commented 14 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

Here is a new patch that incorporates most of the ideas we have discussed. I tried addressing the larger issue of the FieldCache mess in #3739, but that is too big to tackle in one go. After getting something to almost work for #3739, I then just took the EntryCreator stuff and am adding that to 'LUCENE-2649'

As such, some of the choices about how EntryCreator works are based on future plans, and my feel akward today. Speciffically:

  1. In the future, a Cache on the IndexReader should not necessarily be tied to a field name. To do this, the field name parameter should be part of the EntryCreator. In this first pass, we will need to pass the field name twice:
IntValues  vals = cache.getInts(reader, "fieldName", new IntValuesCreator( "fieldName", parser, flags ) )

I think the tradeoff is OK, and makes fewer changes in the future.

  1. In the future, the EntryCreator.getCacheKey() should be the only key stored. To fall within the existing structure, the entire EntryCreator is stored on the 'custom' field on the internal cache, but the equals and hashCode values are bubbled up. This makes more sense for #3739. For now we need to be careful that the EntryCreator classes are reasonable things to store as Keys (it includes the Parser, etc)

I added a bunch of tests to exercise how sequential calls with different options behave.

I think this patch is ready to commit to /trunk – when it is in, I'll make a patch for #3739.

Since /trunk is now a bit more experimental, and I feel pretty good about the feedback this has had I will probably jump the gun and commit soon

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Apologies for 'CTR' rather then 'RTC' -- we can always revert if I jumped the gun!

Better to ask forgiveness than permission :)

In fact I'm +1 on switching Lucene's trunk to CTR model instead, now that we have 3.x as the stable branch. We have enough "policemen" around here that I think this'd work well.

The changes look great Ryan – nice work!

Some smallish feedback:

I think we should hold off on backport to 3.x until we stabilize

3739?

It looks like you've also fixed #3601 with this? Ie the fasterButMoreRAM=true|false now cache to the same key? It's just that perhaps we should "upgrade" the entry, if it was first created w/ false and then the current call passes true?

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think we have a synchronization issue on the call to validate? Ie, more than 1 thread can enter validate, and eg compute the valid bits (if they weren't computed on the first create())?

asfimport commented 14 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

It's just that perhaps we should "upgrade" the entry, if it was first created w/false and then the current call passes true?

We just need to watch the thread safety of stuff like this. For Bits, it should be trivial... if you didn't ask for bits, you shouldn't be looking at it. fasterButMoreRAM is different... upgrading an existing entry could be tricky.