apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.63k stars 1.02k forks source link

[PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField [LUCENE-2133] #3209

Open asfimport opened 14 years ago

asfimport commented 14 years ago

Hi all,

up to the current version Lucene contains a conceptual flaw, that is the FieldCache. The FieldCache is a singleton which is supposed to cache certain information for every IndexReader that is currently open

The FieldCache is flawed because it is incorrect to assume that:

  1. one IndexReader instance equals one index. In fact, there can be many clones (of SegmentReader) or decorators (FilterIndexReader) which all access the very same data.
  2. the cache information remains valid for the lifetime of an IndexReader. In fact, some IndexReaders may be reopen()'ed and thus they may contain completely different information.
  3. all IndexReaders need the same type of cache. In fact, because of the limitations imposed by the singleton construct there was no implementation other than FieldCacheImpl.

Furthermore, FieldCacheImpl and FieldComparator are bloated by several static inner-classes that could be moved to package level.

There have been a few attempts to improve FieldCache, namely #1906, #2653 and #2823, but the overall situation remains the same: There is a central registry for assigning Caches to IndexReader instances.

I now propose the following:

  1. Obsolete FieldCache and FieldCacheKey and provide index-specific, extensible cache instances ("IndexCache"). IndexCaches provide common caching functionality for all IndexReaders and may be extended (for example, SegmentReader would have a SegmentReaderIndexCache and store different data than a regular IndexCache)
  2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. IndexFieldCache is an interface just like FieldCache and may support different implementations.
  3. The IndexCache instances may be flushed/closed by the associated IndexReaders whenever necessary.
  4. Obsolete FieldCacheSanityChecker because no more "insanities" are expected (or at least, they do not impact the overall performance)
  5. Refactor FieldCacheImpl and the related classes (FieldComparator, SortField)

I have provided an patch which takes care of all these issues. It passes all JUnit tests.

The patch is quite large, admittedly, but the change required several modifications and some more to preserve backwards-compatibility. Backwards-compatibility is preserved by moving some of the updated functionality in the package org.apache.lucene.search.fields (field comparators and parsers, SortField) while adding wrapper instances and keeping old code in org.apache.lucene.search.

In detail and besides the above mentioned improvements, the following is provided:

  1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved from SegmentReader to SegmentReaderIndexCache.
  2. A housekeeping improvement to CloseableThreadLocal. Now delegates the close() method to all registered instances by calling an onClose() method with the threads' instances.
  3. Analyzer.close now may throw an IOException (this already is covered by java.io.Closeable).
  4. A change to Collector: allow IndexCache instead of IndexReader being passed to setNextReader()
  5. SortField's numeric types have been replaced by direct assignments of FieldComparatorSource. This removes the "switch" statements and the possibility to throw IllegalArgumentExceptions because of unsupported type values.

The following classes have been deprecated and replaced by new classes in org.apache.lucene.search.fields:

Final notes:

Cheers, Christian


Migrated from LUCENE-2133 by Christian Kohlschütter, updated Sep 23 2010 Attachments: LUCENE-2133.patch (versions: 3), LUCENE-2133-complete.patch Linked issues:

asfimport commented 14 years ago

Christian Kohlschütter (migrated from JIRA)

The main patch (for src/java)

asfimport commented 14 years ago

Christian Kohlschütter (migrated from JIRA)

Patch for JUnit tests (src/test) to enable them to use the new API (may be applied after approval of main patch)

asfimport commented 14 years ago

Christian Kohlschütter (migrated from JIRA)

Patch for demo code (src/demo), may be applied after approval of main patch

asfimport commented 14 years ago

Christian Kohlschütter (migrated from JIRA)

Patch for contrib code (src/contrib), may be applied after approval of main patch

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Thanks for the patch and the explanation!

Some notes, without a deep insight:

There may be more problems, but I will review the patch more detailed!

Uwe

asfimport commented 14 years ago

Christian Kohlschütter (migrated from JIRA)

Hi Uwe,

thanks for reviewing. Just a quick response to your comment:

The problem with your current patch is the same as with the other propsals: backwards compatibility is not yet adressed, but is the most hairy problem with all other approaches.

As I wrote, the patch does provide backwards compatibility. Please correct me if I am wrong :) I think actually over-stressed the backwards-compatibility issue because many classes were marked as "Expert; can be changed in incompatible ways in the next release" (e.g. SortField/FieldComparator, Collector)

The change of setNextReader() to use only the cache may not be the correct approach, as the collection of results is IndexReader-specific and not cache-specific (the Collectors in core for sorting only need that, but other collectors outside Lucene, e.g. in Solr need the IndexReader). It is e.g. needed for building filters and so on.

You can access the IndexReader from IndexCache and vice versa. The patch actually contains a few changes for contrib where this is actually utilized.

Passing IndexCache instead of IndexReader to setNextReader gives a slight gain in performance for most cases (i.e. whenever only the IndexCache is accessed), since there is no need to traverse the IndexReader-to-IndexCache method chain (IndexReader#getIndexCache() and the synchronized getOrCreateIndexCache()) for each call to setNextReader.

In all the other cases, newCache.getIndexReader() gives full access to the IndexReader. There even is a default method which calls the (now-to-be-deprecated) newIndexReader(IndexReader,int) method, so there are actually zero changes necessary for existing code.

Also the patch file changes files, that are not related or removes import statements, that are clearly marked as "// javadocs only".

The corresponding references in the javadocs comments have been removed (e.g. FieldCache has been replaced to IndexFieldCache), so it made sense to remove the imports as well.

For version 3.1: The "affects version", setting should be 2.9/3.0, as 3.1 is not yet released. An fix version cannot be yet given, thats correct.

Indeed.

Best regards, Christian

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

The change of setNextReader() to use only the cache may not be the correct approach, as the collection of results is IndexReader-specific and not cache-specific (the Collectors in core for sorting only need that, but other collectors outside Lucene, e.g. in Solr need the IndexReader). It is e.g. needed for building filters and so on.

This is one of the backwards breaks. As noted, the Collector abstract methods cannot be changed and should not, as the IndexReader is the important part used for collecting the results. The cache is only used by sorting but not in general. Use of IndexCache here would be a design flaw, because it combines unrelated things.

And setNextReader is only called seldom (only once for each segment). It would have an impact if you would need to call synchronized methods inside collect().

asfimport commented 14 years ago

Christian Kohlschütter (migrated from JIRA)

Hi Uwe,

your arguments seem reasonable to me. I have added a new patch (src/java only), leaving all test and contrib classes intact (but still passes all tests). The new patch now keeps setNextReader as it is.

Cheers, Christian

asfimport commented 14 years ago

Christian Kohlschütter (migrated from JIRA)

Improved patch (src/java only), now without modification to newIndexReader.

asfimport commented 14 years ago

Christian Kohlschütter (migrated from JIRA)

Added code to automatically evict entries from IndexFieldCache upon close of IndexCache/IndexReader (see also LUCENE-2135)

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Christian, could you roll up all patches into a single attachment? (Actually it looks like the separate demo, contrib, test, core patches were removed).

asfimport commented 14 years ago

Christian Kohlschütter (migrated from JIRA)

Michael, LUCENE-2133.patch now contains all the patches for src/java. I have removed the test, contrib and demo files because they actually do not belong to the patch (since LUCENE-2133.patch really provides complete backwards compatibility now). It would not make much sense to claim backwards compatibility and at the same time modify a bunch of test classes, would it? :)

Nevertheless, I am now preparing an updated LUCENE-2133-complete.patch with all these additional patches included. In the meantime you may simply apply LUCENE-2133.patch to your local trunk copy and see if it works for you.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I have removed the test, contrib and demo files because they actually do not belong to the patch (since LUCENE-2133.patch really provides complete backwards compatibility now).

Ahh OK

asfimport commented 14 years ago

Christian Kohlschütter (migrated from JIRA)

The same patch as LUCENE-2133.patch plus the changes in src/test and contrib to fully utilize the new IndexFieldCache and SortField code.

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

There are a bunch or unrelated changes (imports/names/exception thrown) that should be pulled from this patch.

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Hmm ... nevermind. The exception is related and most of the imports are correct - brain spin.

Didn't see that

import org.apache.lucene.search.SortField; // for javadocs

wasn't being used anymore anyway.

import org.apache.lucene.search.fields.IndexFieldCache in NumericQuery should get a //javadoc so someone doesn't accidently remove it.

And I guess the t to threadLocal change doesn't hurt with the amount your changing that anyway. Its a better name.

This looks pretty nice overall.

asfimport commented 14 years ago

Christian Kohlschütter (migrated from JIRA)

Mark, could you please name the changes you would like to be excluded? I thought I had only included necessary changes.

Some dependent changes are not so obvious, but necessary.

For example, Analyzer.close now needs to throw an IOException because of CloseableThreadLocal now closing an IOException because of doClose() throwing an IOException because it may close references from an IndexCache that might throw an IOException. Luckily this is covered by java.io.Closeable (which declared #close() to throw an IOException), and Analyzer implements that interface already.

asfimport commented 14 years ago

Christian Kohlschütter (migrated from JIRA)

Hmm ... nevermind. The exception is related and most of the imports are correct - brain spin.

OK, no problem.

This looks pretty nice overall.

Thanks :-)

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

A couple more quick notes:

I know the FieldComparator class is ugly, but I'm not sure we should pull the rug by putting the impls in a new package. On the other hand, its not likely to affect many and it was experimental - so its a tough call. Its a lot of classes in there ;)

I'm also not sure if fields is the right package name? And do the Filters belong in that package?

Also, almost a non issue, but extending a deprecated class is going to be an ultra minor back compat break when its removed. Not likely a problem though. But we might put a note to that affect to be clear. It is almost self documenting anyway though :)

Rather then changing the tests to the new classes, we should prob copy them and make new ones - then remove them when the deprecations are removed.

Also, you should pull the author tag(s) - all credit is through JIRA and Changes. (I only see it like once, so I bet thats eclipse?)

I havn't done a thorough review it all, but this is pretty great stuff to appear so complete and out of nowhere :)

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

It looks like FieldCacheTermsFilterDocIdSet is using the wrong StringIndex? And I think the FieldCache import in that class can be removed (same with IndexFieldCacheRangeFilter).

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I dont quite understand why we would have no more insanity? What happens when you get the cache from a multireader and then from each segment reader within it? You double up right? Insanity?

asfimport commented 14 years ago

Christian Kohlschütter (migrated from JIRA)

I know the FieldComparator class is ugly, but I'm not sure we should pull the rug by putting the impls in a new package. On the other hand, its not likely to affect many and it was experimental - so its a tough call. Its a lot of classes in there

I think it makes sense to move the stuff into another package because o.a.l.search is already filled with a lot of classes. Since there are no package-level-protection dependencies to o.a.l.search, I think it does not hurt either.

I'm also not sure if fields is the right package name? And do the Filters belong in that package?

I couldn't really come up with a better name. It's all about the fields somehow (caches, comparators and value parsers).

Also, almost a non issue, but extending a deprecated class is going to be an ultra minor back compat break when its removed. Not likely a problem though. But we might put a note to that affect to be clear. It is almost self documenting anyway though

Yes. I thought it's better to make it as clear as possibe.

Also, almost a non issue, but extending a deprecated class is going to be an ultra minor back compat break when its removed. Not likely a problem though. But we might put a note to that affect to be clear. It is almost self documenting anyway though

Rather then changing the tests to the new classes, we should prob copy them and make new ones - then remove them when the deprecations are removed.

Yes, good idea. This way we can test the old and the new behaviour at once. Maybe it is enough to add new test methods to the same class?

Also, you should pull the author tag(s) - all credit is through JIRA and Changes. (I only see it like once, so I bet thats eclipse?)

Sorry, Eclipse defaults. I thought I had removed all of them.

I haven't done a thorough review it all, but this is pretty great stuff to appear so complete and out of nowhere

Yes, it also just came over me the last weekend ;-)

I will make a new patch tomorrow (CET time) with the new test cases incorporated.

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I think it does not hurt either.

I didn't notice that you actually just deprecated the originals - I guess thats not a complete rug pull ...

By the way, I don't think you need to deprecate something in a new class ( IndexFieldCacheImpl):

  /**
   * `@deprecated` Use {`@link` #clear()} instead.
   */
  public void purgeAllCaches() {
    init();
  }
asfimport commented 14 years ago

Christian Kohlschütter (migrated from JIRA)

It looks like FieldCacheTermsFilterDocIdSet is using the wrong StringIndex? And I think the FieldCache import in that class can be removed (same with IndexFieldCacheRangeFilter).

Good catch. Since o.a.l.search.field.StringIndex extends o.a.l.search.StringIndex it's just a matter of declaration. The imports can indeed be removed (they were only required for javadocs, which should now also refer to the new classes instead). I have made the changes locally and will put them into the next update for this patch.

I dont quite understand why we would have no more insanity? What happens when you get the cache from a multireader and then from each segment reader within it? You double up right? Insanity?

The MultiReader has a different cache than the internal SegmentReaders, there is no interconnection between the two. While change the testcases, my patch initially even triggered a JUnit failure because of this – o.a.l.util.TestFieldCacheSanityChecker expected a cache error which now will not happen anymore.

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

And what about the doubling up insanity? It looks like you just commented out that check? It appears to me that thats still an issue we want to check for - we want to make sure Lucene core and users have a way to be sure they are not using a toplevel reader and its sub readers for caches unless they really intend to.

edit

This type of change actually even exaggerates that problem (though if we want to improve things here, its something we will have to deal with).

Now you might have a mixture of old api/new api caches as well if you don't properly upgrade everything at once.

asfimport commented 14 years ago

Christian Kohlschütter (migrated from JIRA)

I didn't notice that you actually just deprecated the originals - I guess thats not a complete rug pull ...

Writing the new code was one day, then making it backwards compatible with all these deprecations another one :-b I actually wouldn't care so much about backwards compatibility in the most cases, but I think it is better now to allow a smooth transition to the new code.

By the way, I don't think you need to deprecate something in a new class ( IndexFieldCacheImpl):

Indeed. This was a leftover from the early changes to src/test code. It's removed now (locally).

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

After removing the very problematic change to the Collector class (which is very central in Lucene and should not be changed again after 2.9), thatI told you to do in the morning, the other changes are no longer too intrusive. I am quite happy with your new patch and I now also like it very much. It is as a good candiate for replacing the very, very ugly FieldCache impl we currently have.

I am not really sure, if the package name is good and I would like to also add Earwins changes and not bind the cache so hard to the IndexReader (which was also the problem with the last FieldCache), instead just make it a plugin component (like all the other flex parts). For the functionality of Lucene, FieldCache is not needed, sorting is just an addon on searching, but not realy basic functionality (lots of users have other sorting algos). Also not sure if all classes in search that contain "FieldCache" should be renamed. The FieldCacheTermsFilter and RangeFilter only use the cache internally, they should simply be changed to use the new API and maybe only get additional ctors for the other parser instance classes. So some stuff still needs to be changed.

If it fits good together with the new flexible indexing branch, I see no problems with appling it soon. So its all good work. It is a pity, tht we heard not much from you in the past, the patch suddenly appeared in JIRA and almost nobody know you. I only found one introduction from you long time ago on java-dev. Are you still working at L3S? If yes, send nice greetings also to Jan Brase! :-)

This patch and the flex branch together and the many deprecations would make a 3.9 soon and 4.0 without all ugly stuff soon would be nice. I again would do the cleaning heavy commiting police man!

So, good work. Thanks!

asfimport commented 14 years ago

Christian Kohlschütter (migrated from JIRA)

And what about the doubling up insanity? It looks like you just commented out that check? It appears to me that thats still an issue we want to check for - we want to make sure Lucene core and users have a way to be sure they are not using a toplevel reader and its sub readers for caches unless they really intend to.

The new IndexFieldCacheSanityChecker can be removed out-of-the-box (as documented in the class itself). It is just kind of a "showcase" class for this issue. The section I commented out is non-functional with the new change since there is no more ReaderKey in IndexFieldCache.CacheEntry.

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

But we still want its functionality - we still want to check for "doubling" up insanity.

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

not bind the cache so hard to the IndexReader (which was also the problem with the last FieldCache), instead just make it a plugin component

At a minimum, you should be able to set the cache for the reader.

For the functionality of Lucene, FieldCache is not needed, sorting is just an addon on searching

The way he has it, this is not just for the fieldache, but also the fieldsreader and vectorreader - if we go down that road, we should consider norms as well.

I see no problems with appling it soon

I still think it might be a little early. This has a lot of consequences.

asfimport commented 14 years ago

Christian Kohlschütter (migrated from JIRA)

Uwe, thanks a lot for your assessment!

I will definitely look at the flex branch and see what needs to be changed.

Yes, I still work at L3S, working hard for my PhD (planned to finish mid 2010). This is probably the main reason for not being so active in the Lucene community in the past. However, as I use Lucene for research and commercial systems over the last years, I always try to contribute back patches whenever possible.

Jan Brase doesn't work at L3S anymore, but I will happily forward the greetings.

Cheers, Christian

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I still think it might be a little early. This has a lot of consequences.

I mean we should not wait for years like with #1906 :-) No heavy committing please g.

asfimport commented 14 years ago

Christian Kohlschütter (migrated from JIRA)

New changes from the discussions incorporated into the src/java patch. The complete patch (including src/test and contrib) follows tomorrow.

asfimport commented 14 years ago

Christian Kohlschütter (migrated from JIRA)

But we still want its functionality - we still want to check for "doubling" up insanity.

Mark, the latest patch re-inserts the commented fragment.

The "readerkey" in the original FieldCacheSanityChecker refers to the FieldCacheKey in IndexReader, which is now obsoleted by IndexReader#getIndexCache.getIndexReader(). I now use this as the reader key, even though I am still not quite sure if it is really necessary. Checking for insanity sounds so paranoid to me. If we know that there is a conceptional bug in the code, we should fix it, not circumvent and hotfix it using an insanity checker. But this is another story.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

This patch is a good step forward – it associates the cache directly with IndexReader, where it belongs; it cleanly decouples cache from reader (vs the hack we have today with IndexReader.getFieldCacheKey), so that eg cloned readers can share the same cache; it also preserves back compat, which is quite a stunning accomplishment :)

But... there are many more things I don't like about FieldCache, that I'm not sure ? the patch addresses:

Some other questions about the patch:

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

My impression is that this is not much different than #1906, and we are stuck on the same issues (831 started down the path of working with these goals, but its severely out of date now).

It's a little strange that the term vectors & fields reader also got pulled into the cache?

Couldn't figure this out either ... if anything, you'd think norms might goto the cache, but the vector and fields reader ... ?

asfimport commented 14 years ago

Christian Kohlschütter (migrated from JIRA)

This patch is a good step forward - it associates the cache directly

with IndexReader, where it belongs; it cleanly decouples cache from reader (vs the hack we have today with IndexReader.getFieldCacheKey), so that eg cloned readers can share the same cache; it also preserves back compat, which is quite a stunning accomplishment

Yes, backwards compatibility was actually the driving factor for this patch. I actually have not addressed major changes in functionality. This would definitely require rework that breaks backwards compatibility. I just wanted to get rid of the static FieldCache, which caused me a lot of OOME headaches.

As I wrote above, I see this patch as a good starting point for further development. Imagine I had submitted a real, complete rework of the FieldCache like in #1906 – it would be stuck as an open issue forever just like its predecessor.

There is nothing wrong with the current patch (that's why I suggest comitting it to trunk soon-ish) – it does not break anything (it could even go into 3.1 I guess). Starting from a common codebase we can then later on extend it and address all the points you mentioned in Michael's most recent post:

But... there are many more things I don't like about FieldCache, that I'm not sure the patch addresses:

bq. (snip)

None of these (very important) issues have been addressed by the patch. Intentionally (as described).

Some other questions about the patch:

bq.

  • Consumers of the cache API (the sort comparator,

bq. FieldCacheTerms/RangeFilter, and any other future users of "the

 field cache") shouldn't have to move down into fields sub-package?

As you wish. I don't have problems with keep it in o.a.l.search. I was just a little scared about the plethora of classes in that package. Since we do not need to utilize package-level protection, it was actually possible to "encapsulate" that field-related functionality in another namespace. I just personally prefer to use many nested packages, I do not have problems of moving classes back to its parent package.

  • It's a little strange that the term vectors & fields reader also

bq. got pulled into the cache?

They are not in the FieldCache. Caching fields is only one functionality provided by IndexCache. I took the opportunity to demonstrate caching something other than fields. You now save a few bytes per SegmentReader instance because of that change. Maybe it would make sense to move SegmentReader.CoreReaders to SegmentReaderIndexCache completely. However, if I had included that as well into this issue, it would again be too large to be discussed in time for the next release.

If you have strong objections against moving these SegmentReader-specific things to the SegmentReaderIndexCache now, I can remove that part from the patch. However, I should probably then file another issue. Unfortunately, this will then depend on the outcome of this LUCENE-2133.

To summarize, I suggest:

  1. Complete the additions to src/test (i.e. do not remove/modify the existing test cases but rather add new ones, as discussed previously)
  2. Commit the patch to svn, target release for Lucene 3.1.
  3. File another issue and discuss functional improvements for the IndexFieldCache part. Use Michael's wishlist as a starting point. Agree on breaking backwards compatibility, target for Lucene 4.0, 3.5 or whatever fancy version number. Incorporate things from #1906, #2308 and also #3211.
  4. File a new issue for the improvements in SegmentReader (move things that are shared between the original reader and its clones to a common SegmentReaderIndexCache, such as CoreReader and ThreadLocals), keep backwards compatibility and target for Lucene 3.1.

What do you think?

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I don't know that back compat is really a concern if we are just leaving the old API intact as part of that, with its own caching mechanism?

Just deprecate the old API, and make a new one. This is a big pain, because you have to be sure you don't straddle the two apis on upgrading, but thats the boat we will be in anyway.

Which means a new impl should provide enough benefits to make that large pain worth enduring. 831 was not committed for the same reason - it didn't bring enough to table to be worth it after we got to a per segment cache in another way. Since I don't see that this provides anything over 831, I don't see how its not in the same boat.

I'm not sure we should target a specific release with this - we don't even know when 3.1 is going to happen. 2.9 took a year. Its anybodies guess - we should prob just do what makes sense and commit it when its ready.

asfimport commented 14 years ago

Christian Kohlschütter (migrated from JIRA)

I don't know that back compat is really a concern if we are just leaving the old API intact as part of that, with its own caching mechanism?

I don't think that this is an option. FieldCache is fundamentally flawed and already creates a lot of trouble that has only been somehow fixed recently (FieldCacheKey, Insanity checks).

FieldCache, the static "registry" of caches sort of violates fundamental OOP concepts – I mean, almost all methods require IndexReader as the first parameter. Since we allow IndexReader composition and decoration this apparently was not the right way to go because it exactly caused the problems that have later been addressed by the aformentioned FieldCacheKey and insanity checks.

Just deprecate the old API, and make a new one. This is a big pain, because you have to be sure you don't straddle the two apis on upgrading, but thats the boat we will be in anyway.

That is always an option, but as you see Uwe's initial reaction I think it is not so easy to convince everybody. Which is fine, because we now have a solution that is somehow intermediate between the two extremes (keeping as it is vs. complete rework) and still is completely bw-compatible. With a "real" complete overhaul of FieldCache, I imagine a lot of additional work to be then done for other projects such as SOLR, Nutch etc., which I would rather see not to happen abruptly.

The IndexCache abstraction allows us to separate the two issues 1: "make caches non-static" and 2: "make the fieldcache snappier" very easily. We start with issue 1 here in LUCENE-2133, and then develop a new FancyFieldCache in LUCENE-xyz (which can be used in parallel to the then-old IndexFieldCache).

In parallel we can integrate Michael's, Earwin's and Yonik's suggestions from #3211 without starting the discussion from the beginning.

I am not suggesting to take the current solution as a "temporary workaround", but as a base for future work. Anything else would make no sense indeed.

Since I don't see that this provides anything over 831, I don't see how its not in the same boat.

1906 still requires a static FieldCache, the root of all evil :) It addresses orthogonal problems (ValueSource, Tries etc.).

Finally, to cite yourself from #1906: "It probably makes sense to start from one of Hoss's original patches or even from scratch."

I'm not sure we should target a specific release with this - we don't even know when 3.1 is going to happen. 2.9 took a year. Its anybodies guess - we should prob just do what makes sense and commit it when its ready.

The more complex the patches are, the longer it will take to integrate them into a new version. The more such patches you have, the longer it will take to get to a new release.

Let's make it simple, submit what we have and build upon that.

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

1906 still requires a static FieldCache, the root of all evil :)

It doesn't require one though? It supports a cache per segment reader just like this. Except its called a ValueSource.

The CacheByReaderValueSource is just there to handle a back compat issue - its something that we would want to get around and use the reader valuesource for instead - but that patch still had a long way to go.

Overall, from what I can see, the approach was about the same.

It probably makes sense to start from one of Hoss's original patches or even from scratch

That was said before a lot more work was done. The API was actually starting to shape up nicely.

The more complex the patches are, the longer it will take to integrate them into a new version.

Of course - and this is a complex issue with a lot of upgrade pain. Like with 831, it not really worth the pain to users without more benefits.

The more such patches you have, the longer it will take to get to a new release.

Thats not really true. 3.1 does't need this patch - there would be no reason to hold it for it. Patches go in when they are ready.

Let's make it simple, submit what we have and build upon that.

I dont think thats simple :) The patch can be iterated on outside of trunk as easy as in.

asfimport commented 14 years ago

Christian Kohlschütter (migrated from JIRA)

bq. #1906 still requires a static FieldCache, the root of all evil

bq. It doesn't require one though? It supports a cache per segment reader just like this. Except its called a ValueSource.

With "requires" I mean it's still there, not marked as deprecated and still used. Plus a lot of "ifs" like

{{{ if(parserUninverter != null) { currentReaderValues = uninversionValueSource.getLongs(reader, field, parserUninverter); } else if(valueSource != null) { currentReaderValues = valueSource.getLongs(reader, field); } else { currentReaderValues = reader.getValueSource().getLongs(reader, field); } }}}

That is, it adds a lot of duplicated code / different possible implementations for the same thing.

I am not saying #1906 was a bad idea. And apparently, apart from the different wording, we see a few similarities with LUCENE-2133. We just need to move on at some point.

What is still different in my proposal is the additional abstraction layer of "IndexCache". Was this already somehow planned with "ValueSourceFactory"? That class exists in #1906 but was never used.

As we see from #3211 Index-specific caches are much more than FieldCache/ValueSource implementations. They should store arbitrary data, allow cache inspection, eviction of entries and so on.

bq. Let's make it simple, submit what we have and build upon that.

I dont think thats simple The patch can be iterated on outside of trunk as easy as in.

It is indeed a complex problem but it can easily be split into several subtasks that can be addressed by different people in parallel. To allow such a development, we have to somehow get the base code it into SVN, not necessarily trunk, admittedly, a branch would also do. Of course, this requires also additional work to keep it in sync with trunk. If we can really assume to have 3.1 in one year, we have lots of time for developing a stable, powerful new API directly in trunk. Of course, this is a decision related to release management and not to the actual problem. I can live with both ways (trunk vs. branch), but, in my opinion, managing the changes just as patch files in jira is not a viable option.

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

That is, it adds a lot of duplicated code / different possible implementations for the same thing.

Things that were still ugly were not likely to stick around - 831 was very much a work in progress. The solution there to handle back compat issues was a working solution that would need to be improved upon. 831 was still in experimentation state - issues that need more though had hacked in working solutions. We had a more general cache at one point, and began working towards ValueSources based on discussion. The latest 831 patch is an exploration of that, not a final product.

They should store arbitrary data, allow cache inspection, eviction of entries and so on.

Thats extremely simple to add to an IndexReader - we were thinking of a ValueSource as something different than a basic cache.

It is indeed a complex problem but it can easily be split into several subtasks that can be addressed by different people in parallel. To allow such a development, we have to somehow get the base code it into SVN, not necessarily trunk, admittedly, a branch would also do. Of course, this requires also additional work to keep it in sync with trunk. If we can really assume to have 3.1 in one year, we have lots of time for developing a stable, powerful new API directly in trunk. Of course, this is a decision related to release management and not to the actual problem. I can live with both ways (trunk vs. branch), but, in my opinion, managing the changes just as patch files in jira is not a viable option.

A branch is certainly a possibility, but with only one person working on it, I think its overkill. With some additional interest, a branch can make sense - otherwise its not worth the merging headaches. You also have to have a committer(s) thats willing to take on the merging.

At one point, 831 was much more like this patch. Discussion along what Mike brought up above started transforming it to something else. We essentially decided that unless that much was brought to the table, the disrupting change just wasn't worth it for a different cache API.

I'm def a proponent of FieldCache reform - but I think we want to fully flesh it out before committing to something in trunk.

asfimport commented 14 years ago

Christian Kohlschütter (migrated from JIRA)

Mark,

that's great to hear overall. What I want to avoid are different competing implementations. Since there are indeed several people working on things that are closely related, esp. LUCENE-2135) a branch would make sense (if all agree, of course).

Can we summarize all the open points for a "worthy" cache overhaul, close #1906 in favor of LUCENE-2133 and continue working here? I would like to hear the opionions of the other people involved: Earwin, Hoss, Mike, Uwe, Yonik to name a few.

Best, Christian

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I was just a little scared about the plethora of classes in that package.

Let's separately worry about that? I think consumers of the cache API shouldn't have to move.

Caching fields is only one functionality provided by IndexCache. I took the opportunity to demonstrate caching something other than fields. You now save a few bytes per SegmentReader instance because of that change.

OK but let's also do this separately (Earwin I think is working on a patch for componentizing SR, which'd be great). FieldCache, "just" one of IndexReaders components, is hard enough to fix; let's just focus on it :)

Also, as hackish as it is, the getFieldCacheKey() works well. The need for insanity checking is annoying, so let's just disallow getting cached @ Dir/MultiReader level (and offer sugar APIs if really needed).

And as annoying as the static FieldCache is, #3211 plugs yet another hold in the dike, ie, now entries are purged on close, and you can also explicitly purge a given IndexReader.

I don't think we should wholesale move the cache APIs just for the sake of moving them. That's alot of API noise; we need some real improvements to justify making users switch.

Since there are indeed several people working on things that are closely related, esp. LUCENE-2135) a branch would make sense

Can we summarize all the open points for a "worthy" cache overhaul, close #1906 in favor of LUCENE-2133 and continue working here?

I don't think we need a branch just yet. (LUCENE-2135 is a very standalone change from this issue). And, before closing other issues, committing this one as a start, etc, we really first need to reach some consensus on what even to do, here.

I really want to see us first fix FieldCache's deep problems, then concern ourselves with where the resulting API will go / what it looks like. Ie, we're putting the horse before the cart, here, so far...

EG, what [minimal] changes could we make to allow someone to plugin their own values source? If we make this:

class FieldValues {
  public FIELD_TYPE getType();
  public int[] getInts() {};
 // and all the other types...
}

class FieldValuesSource implements Closeable {
  public FieldValues getValues(FieldInfo field);
  public void close();
}

We could then make an UninverterFieldValuesSource, subclassing PerFieldValueSource, that takes IndexReader to its ctor, lets your register parsers by field. And it'd allow customization beyond simply changing the parser, eg to control behavior of multi-valued fields, stopping early (NRQ does this), etc.

We'd have CachedFieldValueSource that'd wrap any FieldValuesSource and cache, allowing you to subclass it if you want to do eviction, etc. It'd by default implement the same simplistic policy we have today – retain everything, until close at which point you discard everything.

All consumers of field cache (sorting by field, FieldCacheRange/TermsFilter, function queries, etc.) should all allow me to pass in my own FieldValuesSource. IndexReader would let me customize, but would default to the cached uninversion source.

I could then in theory [external to Lucene] make a FieldValueSource that slurped stuff from disk, and I could create #2308 (= CSF) outside of Lucene. [And, when we build CSF inside of Lucene it'd also just be another source].

Something along these lines maybe....?

asfimport commented 14 years ago

Earwin Burrfoot (migrated from JIRA)

I would like to hear the opionions of the other people involved

bq. .... Earwin I think is working on a patch for componentizing SR ..... FieldCache, "just" one of IndexReaders components ......

Mike answered for me :) My wish is to keep IR's as basic as possible, while plugging all extras (that includes sorting/whatever caches) on as-needed basis. Right now I have a basic impl that works for me for half a year, but in practice it ended up a bit ugly and it doesn't handle IW-produced readers (never needed/liked this feature). So I hope to fix these two deficiencies on a weekend.

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Something along these lines maybe....?

And we are back to 831 :)

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

And we are back to 831

Yes :) But maybe the fresh perspective will get us through it!