apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.47k stars 980 forks source link

FieldCache introspection API [LUCENE-1749] #2823

Closed asfimport closed 14 years ago

asfimport commented 14 years ago

FieldCache should expose an Expert level API for runtime introspection of the FieldCache to provide info about what is in the FieldCache at any given moment. We should also provide utility methods for sanity checking that the FieldCache doesn't contain anything "odd"...


Migrated from LUCENE-1749 by Chris M. Hostetter (@hossman), 1 vote, resolved Aug 12 2009 Attachments: fieldcache-introspection.patch, LUCENE-1749.patch (versions: 16), LUCENE-1749-hossfork.patch Linked issues:

asfimport commented 14 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

The motivation for this issue is all of the changes coming in 2.9 in how Lucene internally uses the FieldCache API – the biggest change being per Segment sorting, but there may be others not immediately obvious.

While these changes are backwards compatible from an API and functionality perspective, they could have some pretty serious performance impacts for existing apps that also use the FieldCache directly and after upgrading the apps suddenly seem slower to start (because of redundant FieldCache initialization) and require 2X as much RAM as they did before. This could lead people people to assume Lucene has suddenly became a major memory hog. SOLR-1111 and SOLR-1247 are some quick examples of the types of problems that apps could encounter.

Currently the only way for a User to even notice the problem is to do memory profiling, and the FieldCache data structure isn't the easiest to understand. It would be a lot nicer to have some methods for doing this inspection programaticly, so users could write automated tests for incorrect/redundent usage.

asfimport commented 14 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

Here's the start of a patch to provide this functionality – it just provides a new method/datastructure for inspecting the cache; the sanity checking utility methods should be straightforward assuming people think this is a good idea.

The new method itself is fairly simple, but quite a bit of refactoring to how the caches are managed was necessary to make it possible to implement the method sanely. These changes to the FieldCache internals seem like they are generally a good idea from a maintenance standpoint even if people don't like the new method.

asfimport commented 14 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

Technically this isn't a bug, so i probably shouldn't add it to the 2.9 blocker list, but i really think it would be a good idea to have something like this in the 2.9 release.

At the very least: i'd like to put it on the list until/unless there is consensus that it's not needed.

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

nice - would be great if it could estimate ram usage as well.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

+1 – this'd be great to get into 2.9.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Looks good as a start, one question about a comment:

What do you mean with:

I do not understand, do you want to remove the IntCache? What is different with it in comparison with the other ones?

Uwe

asfimport commented 14 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

:TODO: is the "int" sort type still needed? ... doesn't seem to be used anywhere, code just tests "custom" for SortComparator vs Parser.

sorry ... badly placed quotes ... that was in referent to Entry.type.

Until i changed getStrings, getStringIndex, and getAuto to construct Entry objects as part of my refactoring the "type" attribute (and the constructor that takes a type argument) didnt' seem to be used anywhere (as far as i could tell)

My guess: maybe some previous changes refactored logic that switched on type up into the SortFields?, so the FieldCache no longer needs to care about it?

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Here is a start towards guessing the fieldcache ram usage.

It probably works fairly well, though it will be limited by stack space on a very heavily nested object graph.

I've added the size guess for getValue in the introspection output.

Its a start anyway.

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

We prob would want to provide an alternate toString that includes the ram guess and the default that skips it - i havn't tested performance, but it might take a while to check a gigantic string array.

Also, JavaImpl should probably actually be JavaMemoryModel or MemoryModel.

asfimport commented 14 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

More progress building on Mark's patch.

added some sanity checking that reader/fieldname combos aren't reused in odd ways – i made it ignore cases where different parsers ultimately resolve to identical cache objects (ie null vs DEFAULT_LONG_PARSER) and it ignores any CreationPlaceholder objects (not sure about that one)

some tests were modified to make their pathological behavior more "sane" and hooks were addded so that future tests can bypass the sanity testing in the tearDown method if they really need to.

Still need sanity testing of the Reader/subreader variety. also lots of docs and code cleanup.

BTW: i was focused on test-core ... still waiting on test-contrib to finish running, so i'm not yet sure if i broke anything there.

asfimport commented 14 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

note to self: of the contribs, TestRemoteSort had two failed tests (not horribly surprising) and PatternAnalyzerTest generated an Error (?!?!) ...

java.lang.IllegalStateException: termText
    at org.apache.lucene.index.memory.PatternAnalyzerTest.assertEquals(PatternAnalyzerTest.java:213)
    at org.apache.lucene.index.memory.PatternAnalyzerTest.run(PatternAnalyzerTest.java:148)
    at org.apache.lucene.index.memory.PatternAnalyzerTest.testMany(PatternAnalyzerTest.java:87)
asfimport commented 14 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

minor checkpoint: improved assert messages, and massaged TestRemoteSort so that it appearers more sane.

problem with PatternAnalyzerTest was unrelated (see LUCENE-1756)

asfimport commented 14 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

Hmmm... somehow i overlooked the fact that even after i "fixed" TestRemoteSort in the last patch, it's still failing. Here's the assertion failure...

junit.framework.AssertionFailedError: testRemoteCustomSort Comparator: multi FieldCaches for same reader/fieldname with diff types
   at org.apache.lucene.util.LuceneTestCase.assertSaneFieldCaches(LuceneTestCase.java:110)
   at org.apache.lucene.search.TestRemoteSort.testRemoteCustomSort(TestRemoteSort.java:261)
   at org.apache.lucene.util.LuceneTestCase.runTest(LuceneTestCase.java:265)

...and here's the debugging dump of the FieldCache...

*** BEGIN testRemoteCustomSort Comparator: FieldCache Losers ***
'org.apache.lucene.index.DirectoryReader@1108727'=>'custom',interface java.lang.Comparable,9,org.apache.lucene.search.SampleComparable$2@651e95,null=>[Ljava.lang.Comparable;#22056753 size guess:2 KB
'org.apache.lucene.index.DirectoryReader@1108727'=>'custom',interface java.lang.Comparable,9,org.apache.lucene.search.SampleComparable$2@5b78cf,null=>[Ljava.lang.Comparable;#32045680 size guess:2 KB
*** END testRemoteCustomSort Comparator: FieldCache Losers ***
*** BEGIN org.apache.lucene.search.TestRemoteSort.testRemoteCustomSort: FieldCache Losers ***
'org.apache.lucene.index.DirectoryReader@1108727'=>'custom',interface java.lang.Comparable,9,org.apache.lucene.search.SampleComparable$2@651e95,null=>[Ljava.lang.Comparable;#22056753 size guess:2 KB
'org.apache.lucene.index.DirectoryReader@1108727'=>'custom',interface java.lang.Comparable,9,org.apache.lucene.search.SampleComparable$2@5b78cf,null=>[Ljava.lang.Comparable;#32045680 size guess:2 KB
*** END org.apache.lucene.search.TestRemoteSort.testRemoteCustomSort: FieldCache Losers ***

What really confuses me about this is that the same SampleComparable instance is being used with two different queries – once with reverse=true and once with reverse=false, yet two different SampleComparable instances are showing up in cache keys – the probably only happens when SampleComparable is used to get a SortComparator – not when it uses a ComparatorSource earlier in the test.

Is this a real bug in remote sorting?

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Is this a real bug in remote sorting

I think so.

SortField is constructed on the client side and passed as a param to the remote Searchable. It seems you get a new factory every time this happens, not the client factory on the SortField (of course, because its constructed on the other side of the wire).

So every search call adds a new factory to the mix. If you just make the call once, the test will pass.

Its a nice find.

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Here is a patch that just updates the ram usage estimator code.

asfimport commented 14 years ago

Paul Smith (migrated from JIRA)

You know what would be absolute icing on the cake here would be some way during the introspection by some code looking for large sort fields that perhaps can be discarded/unloaded as needed (programmatically).

What I'm thinking here is a use case we've come into where we have had to sort by subject. Well the unique # subjects gets pretty large, and while we still need to support the use case, it'd be nice to be able to periodically 'toss' sort fields like this so they don't hog memory permanently while the IndexReader is still in memory. (sorting by subject is used, just not often so a good candidate for tossing)

Because we have multiple large IndexReaders open concurrently, it'd be nice to be able to scan periodically and kick out any unneeded ones.

It's nice to be able to inspect and print out these, but even better if one can make changes based on what one finds.

asfimport commented 14 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

Mark: i have a little time to work on this today ... do you have any updates that youv'e been working on locally (i noticed some patch add/retract from you in hte history)

Paul: over in #1906 there was a lot of discussion and work done towards making the entire FieldCAche internals pluggable so you could customize the cache behavior all sorts of ways ... i feel out of the loop on that issue, but my understanding is that it was pushed back to 3.1 at the earliest because it wasn't clear how the APIs should be setup given the work being done with reopen and with moving FieldCache usage down to the subreaders.

for now my goal with this issue (LUCENE-1749) is purely to provide an experimental (ie: no back compat expectations) API for app developers to use to sanity check that the changes in 2.9 havne't blown their RAM usage sky high.

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I do - I removed that last patch because I just realized that it was missing everything but one class.

Go ahead though - I'll merge with what you have.

asfimport commented 14 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

uh ... ok .. what kind of updates do you have locally? (no point in merging later if i'm just going to write stuff you've already written)

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

No worries, the updates are to the ram estimator and other minor things (eg if something fails the sanity check, the error output comes out twice because of the double check in teardown ) - nothing feature wise at the moment. I'll see what you add first.

asfimport commented 14 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

checkpoint: really hack implementation of checkFieldCacheSubReaderSanity that tells you when a FieldCache contains entries on the same field in a both wrapper/inner readers ... but it doesn't tell you which entries (and unlike checkFieldCacheTypeSanity it's no obvious just looking at the toString())

This patch causes an error in TestStressSort and LOTS of errors in TestCustomScoreQuery, TestFieldScoreQuery, and TestOrdValues.

I'd like to think these errors are just from tests doing abnormal things, and in the case of TesStressSort that may be true (it looks like it has some hinky reuse of readers in a MultiReader) but in the other test cases where it's a little easier to see at a glance what's going on, there's really nothing odd here – simple use of a single IndexSearcher to execute a CustomScoreQuery, ValueSourceQuery, etc... these are each causing multiple FieldCache instances to pop up for a single field (one keyed on the DirectoryReader and another keyed on a CompoundFileReader$CSIndexInput)

i'm try to work on refactoring the sanity checking methods so they are easier to use (and write some tests for them to prove the work as expected on both sane/insane) but it would be helpful if someone who understands more about how FieldCaches should look (pushed into the subreaders) could tyr out this patch and let me know if these failures look legit.

asfimport commented 14 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

checkpoint: refactored the sanity checking code into a utility class and wrote tests specifically for it to prove it finds insane stuff.

TODO:

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

figure out why previously mentioned tests are breaking (need help with this one ... don't know enough about the code these tests excercise

Eh - its yucky. There are parts where the tests are passing the top level reader (say to a collector) when it should be using the sub readers. I fixed one :) But then there is more - looked at a couple more difficult ones that also pass the top level reader for the test.

And then there is explain - IndexSearcher passes the top level reader to the weight explain, and valuesourcequery will get a fieldcache based on that reader. I guess that one is a bug.

And there are prob a few other similar type things...

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

And then there is explain - IndexSearcher passes the top level reader to the weight explain, and valuesourcequery will get a fieldcache based on that reader. I guess that one is a bug.

I don't even know what to do about this one. All I can think is that you pump out an explain for each sub reader - but thats pretty unhelpful.

Perhaps the best we can do is javadoc the extra requirements that may be needed when you use explain?

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Updates:

tests pass now

asfimport commented 14 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

Mark: thanks for looking into the tests.

If the CustomScoreQuery class(es) push the FieldCache sage into the subReaders during scoring, then shouldn't the explain methods do the same thing? it definitely seems like a bug if getting score explanation from a query causes your memory footprint to double.

Last night i thought over what a more useful API for hte sanity checker would like like ... My power is getting turned off for a few hours this afternoon so i'll work on it them and should have a much cleaner looking patch to post this evening.

(BTW: random thought that occurred to me last night: wouldn't the simplest way to implement the RamEstimator just be to use vanilla java serialization to a custom OutputStream that just counted the bytes and sent them to /dev/null) ?

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

(BTW: random thought that occurred to me last night: wouldn't the simplest way to implement the RamEstimator just be to use vanilla java serialization to a custom OutputStream that just counted the bytes and sent them to /dev/null) ?

That's one way to go. Its got its own little issues though - some bookkeeping stuff is not serialized, and extra info about class, fields is serialzied. Transient fields (niche issue for sure) would also not be serialized. Its def another way to get an estimate. I chose a different route after considering both (googled the topic for a bit and looked at some examples before choosing). I'd be open to another route, but I thought this method was fairly fast, accurate, and generic.

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

If the CustomScoreQuery class(es) push the FieldCache sage into the subReaders during scoring, then shouldn't the explain methods do the same thing? it definitely seems like a bug if getting score explanation from a query causes your memory footprint to double.

It should do the same thing - but thats sticky. If you push explain to the sub readers, you will get why it scored as it did for each subreader - not one top level explain. I won't deny its kind of bug - but I'm not sure at the moment what the best way to address it is. I'll look into the possibility of pushing the fieldcache access to the subreaders while leaving everything else at the top reader - I have no thoughts about the feasibility of that at the moment though. I guess it might be doable.

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Here is a rough draft for an explain fix.

Explain for custom and valuesource now drop to per segment to retrieve fieldcache values. Resolving this issue will also resolve #2845.

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

This issue was a fantastic idea by the way!

asfimport commented 14 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

This is a complete overhaul of the internals of FieldCacheSanityChecker, and it's API so that it's a lot cleaner and easier to use programaticly.

This also makes it easier for tests to run an analsis, and then ignore the types of errors they "expect" without ignoring whole cache entries (so a test that expects to have subreader problems can ignore those, even if one of those cache entires also fails a sanity check with another entry for a different reason (ie: different parser on same reader)

And in the long run: this should make it easier for us to add new types of sanity checks (which i'm guessing we'll think of when the internals get overhauled) without changing the API too much.

NOTE: This is based on Miller's attachment #12414960 (29/Jul/09) ... i haven't merged in or looked at any of the changes he made after that. i suspect the only overlap (if any) is how CacheEntry uses the Ram Estimation code ... i switched to having estimateSize(RamUsageEstimator) cache the value and then toString uses it if it's there ... the FieldCacheSanityChecker takes care of calling it if the client code asks for it.

Mark: viv's got all weekend off, so i'm probably not going to have time to look at this again for 4 or 5 days, if you want to take a stab at merging the patches thta would be seriously awesome.

asfimport commented 14 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

Quick responses to some other comments...

I chose a different route after considering both

i trust you to make the right call, just thought i'd point it out in case you hadn't though of it.

If you push explain to the sub readers, you will get why it scored as it did for each subreader - not one top level explain

I don't really follow you on this (i need to take a look at your proposed fix) .. i'm not suggesting we push the whole explain down to the subreader, just that when the explain method wants to get hte FieldCache value for a doc, it should fetch the FieldCache for the SegmentReader the doc is in – so it gets the exact same value (and same FieldCache entry) as the scoring code did when it scores the document. (or maybe i'm completley missunderstanding how these classes were reimplimented to use segment based field caches)

This issue was a fantastic idea by the way!

yeah ... i was pretty out of the loop on all the "push sorting down into the segment" discussion, but when i noticed yonik pointing out all the ways solr's fieldcache usage was going to explode if we didn't change it it occured to me that this would probably be a big problem for anyone doing non-trivial stuff with Lucene, so it would be nice to have a way to toruble shoot it (i also had very little faith in Lucene-Java's test coverage since we only have unit tests that verify "correct" behavior when we make changes – but nothing sanity checks how that behavior happened (at least: not untill now)

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I don't really follow you on this (i need to take a look at your proposed fix) .. i'm not suggesting we push the whole explain down to the subreader, just that when the explain method wants to get hte FieldCache value for a doc, it should fetch the FieldCache for the SegmentReader the doc is in - so it gets the exact same value (and same FieldCache entry) as the scoring code did when it scores the document. (or maybe i'm completley missunderstanding how these classes were reimplimented to use segment based field caches)

The way the per segment stuff went, we don't push down to the sub readers for the fieldcache per say - we just search each sub reader separately - so per reader fieldcache is just kind of a side effect. Then the top level reader is still used for things like stats and explain.

I switched the explain for the offending stuff (custom/valuesource) to use a DocValues class that does push down to each subreader for the fieldcache though (while everything else still uses the top reader) - its in the scorer, so I added a switch to push down to subreaders for fieldcache access or not - only explain pushes down, while regular scoring doesn't (regular scoring will be working per sub reader anyway, because they are searched one at a time).

I can merge up the patches.

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

the explain method wants to get hte FieldCache value for a doc, it should fetch the FieldCache for the SegmentReader the doc is in

One more note to try and be a bit more clear:

First, I wasn't sure how easy this was to do because I don't know explain code or the function package very well at all (eg I've never used it). And the explain method itself did not grab values from the field cache, it loaded up a scorer that did so. So I just wasn't sure how doable this fix was. Thats why I was saying pushing down explain to the subreader wasn't great, but I wasn't sure what else you could do. The fix turned out to be fairly easy though - the scorer for valuesource just needed two modes - one for normal scoring and one for explain (that breaks up the requests for a fieldcache val per sub reader) - the explain method would work for both ways, but no reason to try and break down per reader when its going to score per reader anyway, so I have both. Standard scorer for valuesource works as it did, and explain trips a setting to break out subreaders and distrib fieldcache requests. And then the custom query needed a tweak to work right (flip that setting) with its underlying valuesource queries.

asfimport commented 14 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

I believe that ConstantScoreQuery will need it's explain() fixed too?

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

In the case that it is a caching filter? I hadn't actually looked to see if there are any other FieldCache ones either - just what tripped the tests.

I guess it could be dealt with the same way? A DocIdSet that distributes to sub readers ...

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Already finding some corner cases with the multi docidset stuff - I'll keep working along those lines a bit, then maybe look at some of the code you have been working on and post another patch later this weekend.

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

In the insanity check, when you drop into the sequential subreaders - I think its got to be recursive - you might have a multi at the top with other subs, or any combo thereof. I can add to next patch.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

This was an excellent idea, and it's great that it uncovered some dangerous and very unexpected places where we are passing top-level reader to the FieldCache (eg that explain() could suddenly populate the FieldCache w/ top-level values is quite shocking!).

ReaderUtil.subSearcher is doing the same thing as DirectoryReader.readerIndex.

I love the RAMUsageEstimator... we have other places that estimate RAM (eg IndexWriter does so for added & deleted docs) that we should eventually cutover to this new API.

I particularly love the new class named Insanity:

  public static Insanity[] checkSanity(FieldCache cache)

MultiDocIdSet/Iterator makes me a bit nervous, because it's further "propogating" a non-segment-based iterator deeper into Lucene than I think we want to. It's similar to eg using DirectoryReader.MultiTermDocs (what Lucene used to do), instead of stepping through the segments yourself.

Also, shouldn't explain most closely match what was done during searching (ie, run "per segment")? So simply pushing explain down to the sub-reader that has the doc seems appropriate? Ie we want it to share as much of the code path as possible with how searching was in fact done?

EG for ConstantScoreQuery.explain, it seems like we should 1) locate the sub-reader that this doc falls in, and 2) get a scorer against that reader, then 3) build up the explanation from that? Likewise for CustomScoreQuery?

In fact.... maybe we should simply fix IndexSearcher.explain to do this for all queries? Ie, get the top-level weight, locate sub-reader that has the doc, un-base the doc, and then invoke QueryWeight.explain with that sub-reader and un-based doc? Then we don't have to do anything special for each query. I think QueryWeight.scorer() shouldn't be expected to handle a "top level reader" being passed in. Ie, higher up in Lucene we should do that switch, so that we don't have to do it (this "valuesFromSubReaders" arg) for every scorer.

Hmm: why do we even have explain at both the QueryWeight and Scorer "levels"? It seems like we should pick one level and do it there, consistently. Most queries seem to only implement the QueryWeight one and often simply throw UOE in the Scorer's explain, but eg PhraseQuery implements in both places.

(BTW: I'll be offline for approx the next 36 hours or so!)

asfimport commented 14 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

In fact.... maybe we should simply fix IndexSearcher.explain to do this for all queries?

That was my first thought... but it would probably break more than it helped right now (by exposing more limitations) - for example idf in TermWeight.explain()

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

bq . Ie we want it to share as much of the code path as possible with how searching was in fact done

Well of course ;) I was a bit hazy on explain, so for some reason I had it in my head that you would have to combine the explanations from multiple subreaders - but of course its a doc at a time, so the doc will only come from one subreader, and the sim/weight will be top level. So easy peasy fix. That boolean valuesFromSubReaders def had some code smell - just didn't have an alternative at the moment - fix then improve !

I'll leave the 'explain at multiple levels' for another issue - I havn't even started thinking about this issue yet - I prefer to code :) Which is kind of an oxymoron.

i don't have the code in front of me, but i thought i was adding the sub readers to the list it's iterating over, so it will eventually recurse all the way to the bottom.

Ah right, sorry about the false alarm. One of the few times I've seen .size() in a for loop where its actually needed ;)

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

changes to just go per reader for each doc - and a couple other unrelated tiny tweaks.

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Patch cleanup - more suitable for browsing.

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I was a bit hazy on explain, so for some reason I had it in my head that you would have to combine the explanations from multiple subreaders

but it would probably break more than it helped right now (by exposing more limitations) - for example idf in TermWeight.explain()

To be a little more clear - this was originally why I went the direction I did - I assumed the reader was being used for stats that needed to come from the top level reader. Gut reaction seeing it go into scorer. I hadn't really checked that, at least for these queries, that wasn't the case - they just use it for the filter/fieldcache.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

That was my first thought... but it would probably break more than it helped right now (by exposing more limitations) - for example idf in TermWeight.explain()

Ugh, you're right.

I think It shouldn't be doing that? Ie, a Weight instance should "capture" all stats needed from the top-level searcher, on creation, and then when we ask for a scorer or an explain (or other future things that take an IndexReader) we should always pass in a single segment reader. This way we don't have to duplicate the "go find the right sub-reader" in many places.

It's interesting that we didn't (I think?) have a similar problem w/ scorer when we switched to passing it the sub-reader.

I'll leave the 'explain at multiple levels' for another issue

It looks like it's up to each query, which level does what. IndexSearcher's explain always calls Weight.explain, but then some Query impls (eg BooleanQuery) do everything in Weight.explain, while others (eg TermQuery, PhraseQuery) do some work in Weight.explain and some in the scorer.

I guess this makes sense: "atomic" Queries (TermQuery, PhraseQuery) will need to fire up a scorer since there's real work to be done to see the specifics of how that doc was matched. Whereas BooleanQuery simply "glues" together other queries so it doesn't need to forward to its [many] scorers.

So the only odd thing is why explain is part of Scorer base class... seems like the method could/should live "privately" to only those queries that need it.

But I agree let's leave this be for now...

asfimport commented 14 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

It's interesting that we didn't (I think?) have a similar problem w/ scorer when we switched to passing it the sub-reader.

Right - that code was well tested and exercised via MultiSearcher in the past (all idf values had to come from Weight to avoid getting idfs per sub-searcher). One thing that's missing for explain() is that there is no way to get "df" as opposed to "idf" from the Weight.

So the only odd thing is why explain is part of Scorer base class

Right.. it doesn't belong there. Perhaps deprecate and remove from the Scorer base in 3.0? (since one can't reliably call it now anyway).

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I've got one more draft here with the smallest of tweaks - javadoc spelling errors, and one perhaps one or two other tiny things - stuff I just would toss out rather than merge - but are you doing anything here right now Hoss? I think not at the moment, so if thats the case I'll put up one more patch before you grab the conch back. Otherwise I'll hold off on anything till you put something up.

asfimport commented 14 years ago

Chris Hostetter (Unused) (@hossman) (migrated from JIRA)

: I've got one more draft here with the smallest of tweaks - javadoc : spelling errors, and one perhaps one or two other tiny things - stuff I : just would toss out rather than merge - but are you doing anything here : right now Hoss? I think not at the moment, so if thats the case I'll put : up one more patch before you grab the conch back. Otherwise I'll hold : off on anything till you put something up.

you have the conch ... i haven't worked on anything related to this issue since my last patch.

i'll try to look at it again tomorow.

-Hoss

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Right - that code was well tested and exercised via MultiSearcher in the past (all idf values had to come from Weight to avoid getting idfs per sub-searcher).

Ahh right.

One thing that's missing for explain() is that there is no way to get "df" as opposed to "idf" from the Weight.

But this only affects the "atomic" queries, right? So eg TermWeight could simply hold onto this value and then use it during explain. Hmm... though TermQuery's ctor doesn't get the df directly, because it calls similarity.idf(term, searcher). I don't really like making a separate additional call to docFreq.

How about, for queries that need to go and look up docFreq, their QueryWeight impls simply hold onto the [top-level] IndexSearcher that had been passed to their ctor, and then do the docFreq call against that, if explain is invoked?

Right.. it doesn't belong there. Perhaps deprecate and remove from the Scorer base in 3.0? (since one can't reliably call it now anyway).

+1

Hoss/Mark do you want to fold it in to the patch, here? Or I can open a new issue?