apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
3.01k stars 1.16k forks source link

Complete overhaul of FieldCache API/Implementation [LUCENE-831] #1906

Open asfimport opened 18 years ago

asfimport commented 18 years ago

Motivation: 1) Complete overhaul the API/implementation of "FieldCache" type things... a) eliminate global static map keyed on IndexReader (thus eliminating synch block between completley independent IndexReaders) b) allow more customization of cache management (ie: use expiration/replacement strategies, disk backed caches, etc) c) allow people to define custom cache data logic (ie: custom parsers, complex datatypes, etc... anything tied to a reader) d) allow people to inspect what's in a cache (list of CacheKeys) for an IndexReader so a new IndexReader can be likewise warmed. e) Lend support for smarter cache management if/when IndexReader.reopen is added (merging of cached data from subReaders). 2) Provide backwards compatibility to support existing FieldCache API with the new implementation, so there is no redundent caching as client code migrades to new API.


Migrated from LUCENE-831 by Chris M. Hostetter (@hossman), 5 votes, updated May 09 2016 Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, fieldcache-overhaul.diff (versions: 2), LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, LUCENE-831.03.31.2008.diff, LUCENE-831.patch (versions: 14), LUCENE-831-trieimpl.patch Linked issues:

asfimport commented 18 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

This is based on an idea i had a few months ago and was recently reminded of because of several mail threads about FieldCache .. so i started fleshing it out on the plane last week.

I'm not entirely happy with it in it's current state, but I wanted to post it and see what people think of the overall approach.

if people like the direction this is going, I would definitely appreciate some help with API critique and good unit tests (so far i've been relying solely on the existing Unit Tests to validate that i'm not breaking anything – but that doesn't really prove that the new APIs work the way they are intended to)

TODO List

asfimport commented 18 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

I haven't looked at the patch yet. However, I do know that a colleague of mine is about to start porting some FieldCache-based Filter stuff to Lucene's trunk. Because that work may conflict with Hoss' changes here, we should see if this get applied first.

asfimport commented 18 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

i haven't had any time to do further work on this issue ... partly because i haven't had a lot of time, but mainly because i'm hoping to get some feedback on the overall approach before any more serious effort investment.

updated patch to work against the trunk (r544035)

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I think this patch is great. Not only does it make all of the sort caching logic much easier to decipher, being able to throw in a sophisticated cache manager like ehcache (which can let caches overflow to disk) could end up being pretty powerful. I am also very interested in the possibility of pre-warming a Reader.

I will spend some time playing with what is here so far.

asfimport commented 17 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

thanks for the feedback mark ... i honestly haven't looked at this patch since the last time i updated the issue ... (i'm not sure if i've even thought about it once since then). it's the kind of things that seemed really cool important at the time, but then ... you know, other things come up.

by all means, feel free to update it.

as i recall, the biggest thing about this patch that was really just pie in the sky and may not make any sense is the whole concept of merging and letting subreaders of MultiReader do their own caching which could then percolate up. I did it on the assumption that it would come in handy when reopening an IndexReader that contains several segments – many of which may not have changed since the last time you opened the index. but i really didn't have any idea how the whole reopening things would work. i see now there is some reopen code in #1818, but frankly i'm still not sure wether the API makes sense, or is total overkill.

it might be better to gut the merging logic from the patch and add it later if/when there becomes a more real use case for it (the existing mergeData and isMergable methods could always be re-added to the abstract base classes if it turns out they do make sense)

asfimport commented 17 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I did some extensive tests with Lucene 2.3 today and was wondering, that IndexReader.reopen() in combination with FieldCache/Sorting does not bring a performance increase.

Until now, even if you reopen an Index using IndexReader.reopen(), you will get a new IndexReader instance which is not available in the default FieldCache. Because of that, all values from the cached fields must be reloaded. As reopen() only opens new/changed segments, a FieldCache implementation directly embedded into IndexReader as propsed by this issue would help here. Each segment would have its own FieldCache and reopening would be quite fast.

As with Lucene 2.3 the reopen is possible, how about this issue? Would this be possible for 2.4?

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

As with Lucene 2.3 the reopen is possible, how about this issue? Would this be possible for 2.4?

Yeah, this is a limitation currently of reopen(). I'm planning to work on it after the 2.3 release is out and #1662 is committed!

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I spent a little time getting this patch somewhat updated to the trunk and running various benchmarks with reopen. As expected, the sequence of searching a large index with a sort, adding a few docs and then reopening a Reader to perform a sorted search, can be 10 of times faster.

I think the new API is great too - I really like being able to experiment with other caching strategies. I find the code easier to follow as well.

Did you have any ideas for merging a String index? That is something that still needs to be done...

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Patch roughly moves things forward.

I've added stuff for Long, Short, and Byte parsing, changed getAuto to a static cachkey method, switched core Lucene from using the FieldCache API to the new API, added some javadoc, and roughly have things going with the reopen method.

In short, still a lot to do, but this advances things enough so that you can apply it to trunk and check out the goodness that this can bring to sorting and reopen.

asfimport commented 17 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

Mark:

I haven't looked at this issue or any of the code in my patch since my last updated, nor have i had a chance to look at your updated patch, but a few comments...

1) you rock.

2) I have no idea what i had in mind for dealing with StringIndex when i said "a bit complicated, but should be possible/efficient". I do distinctly remember thinking that we should add support for just getting the string indexes (ie: the current int[numDocs]) for when you don't care what the strings are, just the sort order or just getting a String[numDocs] when you aren't doing sorting and just want an "inverted inverted index" on the field ... but obviously we still need to support the curent StringIndex (it's used in MultiSearcher right?)

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Right, I think its used in MultiSearcher and Parallel-MultSearcher and I believe its because you cannot compare simple ord arrays across Searchers and so you need the original sort term?

I havn't been able to come up with anything that would be very efficient for merging a StringIndex, but I have not thought to much about it yet. Anyone out there have any ideas? Fastest way to merge two String[], each with an int[] indexing into the String[]? Is there a really fast way?

I agree that it would be nice to skip the String[] if a MultiSearcher was not being used.

I'll keep playing with it

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

> I agree that it would be nice to skip the String[] if a MultiSearcher was not being used.

If you're going to incrementally update a FieldCache of a MultiReader, it's the same issue... can't merge the ordinals without the original (String) values.

asfimport commented 17 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

One question here: should we switch to a method call, instead of a straight array, to retrieve a cached value for a doc?

If we did that, then MultiSearchers would forward the request to the right IndexReader.

The benefit then is that reopen() of a reader would not have to allocate & bulk copy massive arrays when updating the caches. It would keep the cost of reopen closer to the size of the new segments. And this way the old reader & the new one would not double-allocate the RAM required to hold the common parts of the cache.

We could always still provide a "give me the full array" fallback if people really wanted that (and were willing to accept the cost).

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

The benefit then is that reopen() of a reader would not have to allocate & bulk copy massive arrays when updating the caches. It would keep the cost of reopen closer to the size of the new segments.

I agree, Mike. Currently during reopen() the MultiSegmentReader allocates a new norms array with size maxDoc(), which is, as you said, inefficient if only some (maybe even small) segments changed.

The method call might be a little slower than the array lookup, but I doubt that this would be very significant. We can make this change for the norms and run performance tests to measure the slowdown.

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

>If you're going to incrementally update a FieldCache of a MultiReader, it's the same issue... can't merge the ordinals without the original (String) >values.

That is a great point.

>should we switch to a method call, instead of a straight array, to retrieve a cached value for a doc?

Sounds like a great idea to me. Solves the StringIndex merge and eliminates all merge costs at the price of a method call per access.

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Hmm...how do we avoid having to pull the cached field values through a sync on every call? The field data has to be cached...and the method to return the single cached field value has to be multi-threaded...

asfimport commented 17 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think if we can finally move to having read-only IndexReaders then they would not sync on this method?

Also, we should still provide the "give me the full array as of right now" fallback which in a read/write usage would allow you to spend lots of RAM in order to not synchronize. Of course you'd also have to update your array (or, periodically ask for a new one) if you are altering fields.

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Here is a quick proof-of-concept type patch for using a method call rather than arrays. Speed pertaining to reopen.

In my quick test of 'open 500000 tiny docs index, repeat(3): add couple docs/sort search' the total time taken was:

Orig FieldCache impl: 27 seconds New impl with arrays: 12 seconds New impl with method call: 3 seconds

Its kind of a worse case scenerio, but much faster is much faster<g> The bench does not push through the point where method 3 would have to reload all of the segments, so that would affect it some...but method one is reloading all of the segments every single time...

This approach keeps the original approach for those that want to use the arrays. In that case everything still merges except for the StringIndex, so String sorting is slow. Lucene core is rigged to use the new method call though, so String sort is as sped up as the other field types when not using the arrays.

Not sure everything is completely on the level yet, but all core tests pass (core sort tests can miss a lot).

I lied about changing all core to use the new api...I havn't changed the function package yet.

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Another push forward:

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

-Further refinements and code changes. -Fixed ord comparisons across IndexReaders - as yonik pointed out. Standard sort tests didn't catch and I missed even with the reminder. -Added a bunch of CacheKey unit tests.

asfimport commented 16 years ago

Michael Busch (migrated from JIRA)

I'm not sure how soon I'll have time to work on this; I don't want to block progress here, so I'm unassigning it for now in case someone else has time.

asfimport commented 16 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Brings this patch back up to trunk level.

asfimport commented 16 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Deprecating the function package is problematic. Doing it by method leaves the classes open to issues maintaining two possible states that could be mixed - the FieldCache and parsers are used as method params - you can't really replace the old internal representation with the new one. You really have to support the old method and new method and tell people not to use both or something. Doing it by class means duplicating half a dozen classes with new names. Neither is very satisfying. This stuff is all marked experimental and subject to change though - could it just be done clean sweep style? A little abrupt but...experimental is experimental <g>

asfimport commented 16 years ago

Fuad Efendi (migrated from JIRA)

Would be nice to have TermVectorCache (if term vectors are stored in the index)

asfimport commented 16 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

That patch may have a goof , I'll peel off another soon. Unfortunately, it looks like this is slower than the orig implementation. I have to run more benchmarks, but it might even be in the 10-20% mark. My guess is that its the method calls - you may gain much faster reopen, but you appear to lose quite a bit on standard sort speed...

Could give the choice of going with the arrays, if they prove as fast as orig, but then back to needing an efficient StringIndex merge...

asfimport commented 16 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I've got the function package happily deprecated with replacement methods.

I am going to ballpark the sort speed with method calls to be about 10% slower, but there are a lot of variables and I am still benchmarking.

I've added a check for a system property so that by default, sorting will still use primitive arrays, but if you want to pay the small sort cost you can turn on the method calls and get faster reopen without cache merging.

So that wraps up everything I plan to do unless comments, criticisms bring up anything else.

The one piece that is missing and should be addressed is an efficient merge for the StringIndex.

Patch to follow.

asfimport commented 16 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Here is the patch - plenty to look over.

The method access might not be as much slower as I thought - might be closer to a couple to 5% than the 10% I first guessed.

asfimport commented 16 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

change norm caching to use new caches (if not the same

main cache, then at least the same classes in a private cache)

What benefit do you see to this? Does it offer anything over the simple map used now?

asfimport commented 16 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

What benefit do you see to this? Does it offer anything over the simple map used now?

I think what i ment there was that if the norm caching used the same functionality then it could simplify the code, and norms would be optimized in the reopen case as well ... plus custom Cache Impls could report stats on norm usage (just like anything else) so people could see when norms were getting used for fields they didn't expect them to be.

But as i've said before ... most of my early comments were pie in the sky theoretical brainstorming comments – don't read too much into them if they don't really make sense.

asfimport commented 16 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Updated to trunk.

Took out an optimization last patch - was using ordinals rather than strings to sort when Reader was not Multi - didn't like the isMulti on IndexReader though, so this patch and the last don't have it.

asfimport commented 16 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

This is missing a good way to manage the caching of the field values per cachekey (allowing for a more fine grained weak or disk strategy for example). That type of stuff would have to be built into the cachkey currently - so to add a new strategy, youd have to implement a lot of new cachekeys and multiply your options to manage...for every int/long/byte/etc array and object array you would have to implement a new cachkey for a weakref solution. Is there an API that can overcome this?

The current cache that you can easily override, cachkey to data, is pretty course...I am not sure how many helpful things you can do easily.

asfimport commented 16 years ago

Alex Vigdor (migrated from JIRA)

Another useful feature that seems like it would be pretty easy to implement on top of this patch is cache warming; that is, the ability to reopen an IndexReader and repopulate its caches before making it "live". The main thing missing is a Cache.getKeys() method which could be used to discover what caches are already in use, in order to repopulate them after reopening the IndexReader. This cache warming could be performed externally to the IndexReader (by calling getCacheData for each of the keys after reopening), or perhaps the reopen method could be overloaded with a boolean "reloadCaches" to perform this in the same method call.

The rationale for this I hope is clear; my application, like many I'm sure, keeps a single IndexReader for handling searches, and in a separate thread from search request handling commits writes and reopens the IndexReader before replacing the IndexReader reference for new searches (reference counting is then used to eventually close the old reader instances). It would be ideal for that separate thread to also bear the expense of cache warming; even with this patch against our 4 GB indices, along with -Duse.object.array.sort=true, a search request coming immediately after reopening the index will pause 20-40 seconds while the caches refill. Preferably that could be done in the background and request handling threads would never be slowed down.

asfimport commented 16 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Also - your reopen time will vary greatly depending on how many segments need to be reopened and what size they are...so I think a lower merge factor would help, while hurting overall search speed of course.

asfimport commented 16 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

You've tried the patch? Awesome!

How long did it take to fill the cache before the patch?

I agree that we should be as friendly to warming as we can be. Keep in mind that you can warm the reader yourself by issuing a sorted search before putting the reader into duty - of course you don't get to warm from RAM like with what you suggest.

Keep that feedback coming. I've been building momentum on coming back to this issue, unless a commiter beats me to it.

asfimport commented 16 years ago

Alex Vigdor (migrated from JIRA)

To be honest, the cache never successfully refilled before the patch - or at least I gave up after waiting 10 minutes. I was about to give up on sorting. It could have to do with the fact that we're running with a relatively modest amount of RAM (768M) given our index size. But with the patch at least sorting is a realistic option!

I will look at adding the warming to my own code as you suggest; it is another peculiarity of this project that I can't know in the code what fields will be used for sorting, but I'll just track the searches coming through and aggregate any sorts they perform into a warming query.

asfimport commented 16 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

i haven't had any time to do further work on this issue ... partly because i haven't had a lot of time, but mainly because i'm hoping to get some feedback on the overall approach before any more serious effort investment.

Wheres that investment Hoss? You've orphaned your baby. There is a fairly decent amount of feedback here.

asfimport commented 16 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I think this would actually be better if all cachekey types had to implement both ObjectArray access as well as primitive Array access. Makes the code cleaner and cuts down on the cachekey explosion. Should have done it this way to start, but couldnt see the forest through the trees back then i suppose.

asfimport commented 16 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Updated to trunk.

I've combined all of the dual (primitive array/ObjectArray) CachKeys into one. Each cache key can support both modes or throw UnsupportedException or something.

I've also tried something a bit experimental to allow users to eventually use custom or alternate cachekeys (payload or sparse arrays or something) that work with internal sorting. A cache implementation can now supply a ComparatorFactory (name will prob be tweaked) that handles creating comparators. You can subclass ComparatorFactory and add new or override current supported CacheKeys.

CustomComparators still needs to be twiddled with some.

I've converted some of the sort tests to run with both primitive and object arrays as well.

asfimport commented 16 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Couple of needed tweaks and a test for a custom ComparatorFactory.

asfimport commented 16 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

change norm caching to use new caches (if not the same

I think we could go even further, and [eventually] change norms to use an iterator API, which'd also have the same benefit of not requiring costly materialization of a full byte[] array for every doc in the index (ie, reopen() cost would be in proportion to changed segments not total index size).

Likewise field cache / stored fields / column stride fields could eventually open up an iterator API as well. This API would be useful if eg in a custom HitCollector you wanted to look at a field's value in order to do custom filtering/scoring.

asfimport commented 16 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

[Note: my understanding of this area in general, and this patch in particular, is still rather spotty... so please correct my misconceptions in what follows...]

This change is a great improvement, since the cache management would be per-IndexReader, and more public so that you could see what's cached, access the cache via the reader, swap in your own cache management, etc.

But I'm concerned, because this change continues the "materialize massive array for entire index" approach, which is the major remaining cost when (re)opening readers. EG, isMergable()/mergeData() methods build up the whole array from sub readers.

What would it take to never require materializing the full array for the index, for Lucene's internal purposes (external users may continue to do so if they want)? Ie, leave the array bound to the "leaf" IndexReader (ie, SegmentReader). It was briefly touched on here:

https://issues.apache.org/jira/browse/LUCENE-1458?focusedCommentId=12650964#action_12650964

I realize this is a big change, but I think we need to get there eventually.

EG I can see in this patch that MultiReader & MultiSegmentReader do expose a CacheData that has get and get2 (why do we have get2?) that delegate to child readers, which is good, but it's not good that they return Object (requires casting for every lookup). We don't have per-atomic-type variants? Couldn't we expose eg an IntData class (and all other types) that has int get(docID) abstract method, that delegate to child readers? (I'm also generally confused by why we have the per-atomic-type switching happening in CacheKey subclasses and not CacheData.)

Then... and probably the hardest thing to fix here: for all the comparators we now materialize the full array. I realize we use the full array when sorting during a search of an IndexSearcher(MultiReader(...)), because FieldSortedHitQueue is called for every doc visited and must be able to quickly make its comparison.

However, stepping back, this is poor approach. We should instead be doing what MultiSearcher does, which is gather top results per-sub-reader, and then merge-sort the results. At that point, to do the merge, we only need actual field values for those docs in the top N.

If we could fix field-sorting like that (and I'm hazy on exactly how to do so), I think Lucene internally would then never need the full array?

This change also adds USE_OA_SORT, which is scary to me because Object overhead per doc can be exceptionally costly. Why do we need to even offer that?

asfimport commented 16 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

One more thing here... while random-access lookup of a field's value via MultiReader dispatch requires the binary search to find the right sub-reader, I think in most internal uses, the access could be switched to an iterator instead, in which case the lookup should be far faster.

EG when sorting by field, we could pull say an IntData iterator from the reader, and then access the int values in docID order as we visit the docs.

For norms, which we should eventually switch to FieldCache + column-stride fields, it would be the same story.

Accessing via iterator should go a long ways to reducing the overhead of "using a method" instead of accessing the full array directly.

asfimport commented 16 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Ah, the dirty secret of 831 - there is plenty more to do :) I've been pushing it down the path, but I've expected radical changes to be needed before it goes in.

But I'm concerned, because this change continues the "materialize massive array for entire index" approach, which is the major remaining cost when (re)opening readers. EG, isMergable()/mergeData() methods build up the whole array from sub readers.

Originally, 3.0 wasn't so close, so there was more concern with back compatibility than there might be now. I think the method call will be a slight slowdown no matter what as well...even with an iterator approach. Perhaps other "wins" will make up for it though. Its certainly cleaner to support 'one' mode.

What would it take to never require materializing the full array for the index, for Lucene's internal purposes (external users may continue to do so if they want)? Ie, leave the array bound to the "leaf" IndexReader (ie, SegmentReader).

I'm not sure I fully understand yet. If you use the ObjectArray mode, this is what happens right? Each sub array is bound to the IndexReader and MultiReader will distribute the requests to the right subreader. Only if you use the primitive arrays and merging do you get the full arrays (when not using USE_OA_SORT).

I realize this is a big change, but I think we need to get there eventually.

Sounds good to me.

(why do we have get2?)

Because a StringIndex needs to access both the array of Strings and a second array indexing into that. None of the other types need to access two arrays.

Couldn't we expose eg an IntData class (and all other types) that has int get(docID) abstract method, that delegate to child readers?

Yeah, I think this would be possible. If casting does indeed cost so much, this may bring things closer to the primitive array speed.

I'm also generally confused by why we have the per-atomic-type switching happening in CacheKey subclasses and not CacheData.

From Hoss' original design. What are your concerns here? The right key gets you the right data :) I've actually mulled this over some, buts its too early in the morning to remember I suppose. I'll look at it some more.

If we could fix field-sorting like that (and I'm hazy on exactly how to do so), I think Lucene internally would then never need the full array?

That would be cool. Again, I'll try to explore in this direction. It doesn't need the full array when using the ObjectArray stuff now though (well, it kind of does, just split up over the readers).

This change also adds USE_OA_SORT, which is scary to me because Object overhead per doc can be exceptionally costly. Why do we need to even offer that?

All this does at the moment (and I hate system properties, but for the moment, thats whats working) is switch between using the primitive arrays and merging or using the distributed ObjectArray for internal sorting. It defaults to using the primitive arrays and merging because its 5-10% faster than using the ObjectArrays. The ObjectArray approach is just an ObjectArray backed by an array for each Reader - a MultiReader distributes a requests for a doc field to the right Readers ObjectArray.

To your second comment...I'm gong to have to spend some more time :)

No worries though, this is very much a work in progress. I'd love to have it in by 3.0 though. Glad to see someone else taking more of an interest - very hard for me to find the time to dig into it all that often. I'll work with the code some as I can, thinking more about your comments, and perhaps I can come up with some better responses/ideas.

asfimport commented 16 years ago

Robert Newson (@rnewson) (migrated from JIRA)

This enhancement is particularly interesting to me (and the application my team is building). I'm not sure how much time I can donate but since I'd likely have to enhance this area of Lucene for our app anyway and it would be better to have it in core, I'd like to help out where I can.

The patch applies more or less cleanly against 2.4.0, but not trunk, btw. Is it possible to get the patch committed to a feature branch off of trunk perhaps?

Finally, I'm most interested in the ability to make disk-backed caches. A very quick attempt to put the cache into JDBM failed as the CacheKey classes are not Comparable, which seems necessary for most kinds of disk lookup structures. SimpleMapCache uses a HashMap, which just needs equals/hashcode methods.

The other benefit to this approach is that it allows the data structures needed for sorting to be reused for range filtering. My application needs both, though on numeric field types (dates predominantly).

Finally, this might also be a good time to add first class support for non-String field types. The formatting that NumberTools supplies is incompatible with SortField (the former outputs in base 36, the latter parses with parseLong), so there's clearly been several approaches to the general problem. In my case, I wrote a new Document class with addLong(name, value), etc.

asfimport commented 16 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Hmmm - not sure what is up. There is already one small conflict for me (trunk is a rapidly changing target :) ), but its a pretty simple conflict.

There are revision number issues (I was connected to an older revision apparently). If thats the problem, try this patch (which also resolves the new simple conflict).

asfimport commented 16 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Maybe every asignee should tag his issues that are related to sorting to be related (or similar) to this one. I am thinking about the latest developments in #2552 and #2555

asfimport commented 16 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Maybe we need two trunks or branches or whatever :-). One for 2.9 and one for 3.0. This is a typical example for that in my opinion.

asfimport commented 16 years ago

Robert Newson (@rnewson) (migrated from JIRA)

The conflict was easy to resolve, it was just an FYI, I appreciate trunk changes rapidly. I was just wondering if a feature branch would make synchronization easier.

Some meta-task to combine or track these things would be great. I hit the problem that #2552 describes. I'd previously indexed numeric values with NumberTools, which gives String-based sorting, the most memory-intensive one.

It seems with this field cache approach and the recent FieldCacheRangeFilter on trunk, that Lucene has a robust and coherent answer to performing efficient sorting and range filtering for float, double, short, int and long values, perhaps it's time to enhance Document. That might cut down the size of the API, which in turn makes it easy to test and tune. Document could preclude tokenization for such fields, I suspect I'm not the only one to build a type-safe replacement to Document.

For what it's worth, I'm currently indexed longs using String.format("%019d") and treating dates as longs (getTime()) coupled with a long[] version of FieldCacheRangeFilter. It achieves a similar goal to this task, the long[] used for sorting is the same as for range filtering.

asfimport commented 16 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

It seems with this field cache approach and the recent FieldCacheRangeFilter on trunk, that Lucene has a robust and coherent answer to performing efficient sorting and range filtering for float, double, short, int and long values, perhaps it's time to enhance Document. That might cut down the size of the API, which in turn makes it easy to test and tune. Document could preclude tokenization for such fields, I suspect I'm not the only one to build a type-safe replacement to Document.

This is an interesting idea. Say we create IntField, a subclass of Field. It could directly accept a single int value and not accept tokenization options. It could assert "not null", if the field wanted that. FieldInfo could store that it's an int and expose more stronly typed APIs from IndexReader.document as well. If in the future we enable Term to be things-other-than-String, we could do the right thing with typed fields. Etc....

asfimport commented 16 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

This is an interesting idea. Say we create IntField, a subclass of Field. It could directly accept a single int value and not accept tokenization options. It could assert "not null", if the field wanted that. FieldInfo could store that it's an int and expose more stronly typed APIs from IndexReader.document as well. If in the future we enable Term to be things-other-than-String, we could do the right thing with typed fields. Etc....

Maybe this new Document class could also manage the encoding of these fields to the index format. With that it would be possible to extend Document, to automatically use my trie-based encoding for storing the raw term values. On the otrher hand RangeQuery would be aware of the field encoding (from field metadata) and can switch dynamically to the correct search/sort algorithm. Great!