apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.58k stars 1.01k forks source link

Nested Document query support [LUCENE-2454] #3528

Closed asfimport closed 13 years ago

asfimport commented 14 years ago

A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene


Migrated from LUCENE-2454 by Mark Harwood (@markharwood), 10 votes, resolved Jun 29 2011 Attachments: LUCENE-2454.patch (versions: 2), LuceneNestedDocumentSupport.zip Linked issues:

asfimport commented 14 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Initial attachment is code plus illustrative data/tests. Fuller unit tests/build scripts etc to follow

asfimport commented 14 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Robust use of this feature is dependent on careful management of segments i.e. that all compound documents are held in the same segment.

Michael Busch suggested the introduction of a new "FlushPolicy" on IndexWriter to offer the required control. (see http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3C4BE5A14C.6040108@gmail.com%3E ) Sounds sensible to me given that IndexWriter currently manages to muddle 2 alternative policies in the one implementation and it looks like we now need a third.

Is this the place to start the debate on "FlushPolicy" ? My guess is this change would involve :

Let me know where it's best to continue the thinking on these IndexWriter changes.

asfimport commented 14 years ago

Earwin Burrfoot (migrated from JIRA)

An alternate approach - there was a discussion on narrowing indexing API to something stream-like, whereas Document becomes its default implementation. We can add some flush-boundary signalling methods, or a notion of composite documents to that new API.

I like this approach more, as control does not spread out across different APIs/instances. You don't have to hold reference to your policy in the indexing code, you don't have to raise/lower flags in some remote code to signal things that are internal to yours.

asfimport commented 14 years ago

Earwin Burrfoot (migrated from JIRA)

Both things can be combined for sure. New stream-like indexing API stuffs docs into IW and controls when flushes /can/ happen, while FlushPolicy decides if they actually /do/ happen, when they /can/.

asfimport commented 14 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

  • there was a discussion on narrowing indexing API to something stream-like

Any idea where there that discussion was taking place? Happy to move flush-control discussions elsewhere if that is more appropriate.

asfimport commented 14 years ago

Earwin Burrfoot (migrated from JIRA)

I thiiiiink, here - #3385

asfimport commented 14 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Wow, this is absolutely awesome! This is one of the best enhancement requests to Lucene/Solr that I've seen as it brings a real enhancement this is difficult / impossible to do without.

The leading concern I have with this implementation is the size of the number of documents in the index as it affects the size of filters and perhaps other areas involving creating BitSet's. I have a scenario in which the sub-documents number on average over 100 to each primary document. These sub-documents are at least very small, and they don't share any fields with the parent. For a large scale search situation, an index containing 3M lucene documents now needs to store over 300M, and thus require 100x the amount of RAM for filter caches as I require now. Thoughts?

asfimport commented 14 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Wow, this is absolutely awesome!

Thanks. I've found that this certainly solves problems I previously couldn't address at all in standard Lucene.

The leading concern I have with this implementation is the size of the number of documents in the index as it affects the size of filters

These filters can obviously be cached but you'll need one filter per level you "roll up" to. Assuming a 300m doc index and only rolling up matches to the root that should only cost 300m /8 bits per byte = 37.5 meg of RAM. Index reloads should avoid the cost of completely rebuilding this filter nowadays because filters are cached at segment level and unchanged segments will retain their cached filters. Perhaps a bigger concern is any norms arrays which are allocated one BYTE (as opposed to one bit) per document in the index.

and they don't share any fields with the parent.

For parents with only 1 child document instance of a given type, these could be safely "rolled up" into the parent and stored in the same document.

asfimport commented 14 years ago

David Smiley (@dsmiley) (migrated from JIRA)

35.7MB of RAM for every filter is a LOT compared to the 357KB I need now (100x). Presumably the facet intersections now take 100x as long too. I cache nearly a thousand of these per index (lots of faceting!) which is by the way just one Solr shard of many. No can do. :-(

I wonder if its plausible to consider a different implementation strategy employing a parallel index with the child documents storing the document IDs to the parent index. I might even assume I need no more than 1000 child documents and thus index blank documents as filler so that if I am looking at a child document with id 32005 then it is the 6th sub-entity belonging to parent document id 32. I know that document IDs are a bit transient so I know that some care would be needed to maintain this strategy. Thoughts?

asfimport commented 14 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Yep, I can see an app with a thousand cached filters would have a problem with this impl as it stands.

Maintaining parallel indexes always feels a little flaky to me, not least because of the loss of transactional integrity you can get from using a single index.

Is another approach to make your cached filters document-type-specific? I.e. they only hold numbers in the range of zero to number-of-docs-of-this-type. To use a cached doc ID in such a filter you would need to make use of mapping arrays to project the type-specific doc id numbers into global doc-id references and back. Lets imagine an index with a mix of "A", "B" and "C" doc types organised as follows: docId docType ===== ======= 1 A 2 B 3 C 4 A 5 C 6 C

The mapping arrays for docType "C" would look as follows

int [ ] globalDocIdToTypeCLookUp = {-1,-1,0,-1,1,2}        // sparse, sized 0-> num docs in overall index
int [ ] typeCToGlobalDocIdLookUp = {0,1,2}          // dense, sized 0-> num type C docs in overall index

Your cached filters would be created as follows:

myTypeCBitset=new OpenBitSet(numberOfTypeCDocs);  //this line is hopefully where you save RAM!
//for all matching type C docs...
myTypeCBitSet.setBit(globalDocIdToTypeCLookUp[realDocId];

Your filters can then be used by dereferencing the child doc IDs as follows:

int nextRealDocId=typeCToGlobalDocIdLookUp [myTypeCBitSet.getNextSetBit()];

Clearly the mapping arrays come at a cost of 4bytes*num docs which is non trivial. The sparse globalDocIdToTypeCLookUp array shown here could be avoided by reading TermDocs and counting at cached-Filter-create time .

asfimport commented 14 years ago

David Smiley (@dsmiley) (migrated from JIRA)

That's an interesting strategy. The size of these arrays is no big deal to me since there's only a couple of them. My concern with this strategy is that I wonder if potentially many places in Solr would have to be become aware of this scheme which might make this strategy untenable to implement even though its theoretically sound.

Another nice thing about the parallel index is that the idf relevancy factor stays clean since it will only consider "real" documents.

I want to investigate these options closer ASAP since this feature you've implemented is something I need. Before I saw this issue, I was going to try something with SpanNearQuery and the masking-field variant.

asfimport commented 14 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

I wonder if potentially many places in Solr would have to be become aware of this scheme

Supporting multiple doc types in this way can be a big modelling change. Not everyone needs it so I suspect that may be a hard sell to the Solr crowd.

Another nice thing about the parallel index is that the idf relevancy factor stays clean

I have an "IDF compensating" wrapper query that takes care of that. It wraps a child query to then wrap the Similarity class in use and adjusts IDF calculations to be based on the number of documents of the required type. I'll attach it here when I get a chance.

Before I saw this issue, I was going to try something with SpanNearQuery and the masking-field variant.

I went through a similar thought process around using position info to make this stuff work. This child-doc approach used here seems the cleanest by far.

asfimport commented 14 years ago

Amit Kulkarni (migrated from JIRA)

This is amazing feature!

Can this help in searching over multiple child/nested documents? Is there a sample code avaialble that demonstrates how to achieve this?

We have requirement wherein search result need to carry fields from child documents. Can this be achieved?

asfimport commented 14 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Can this help in searching over multiple child/nested documents?

Yes, a typical use case is to use "NestedDocumentQuery" to fetch the top 10 parents then do a second query to fetch the children using a mandatory clause which lists the primary keys of the selected parents (assuming the children have an indexed field with the parent primary key). The "PerParentLimitedQuery" can be used to limit the number of child docs returned per parent if there are many e.g. pages in a book. Both these classes are in the zipped attachment to this issue.

asfimport commented 14 years ago

Buddika Gajapala (migrated from JIRA)

I tried this solution and works perfectly for smaller indexes with (either less number of Documents or Document size is small) However for larger indexes that span across multiple segments it only matches the the parent document acurately for the 1st segment. I think this is due to the way the parent docs are marked using a bit array for the ENTIRE index but actual traversing for matching criteria done by the Scorer is segment-by-segment (i.e. in nextDoc() and advance() methods) . Have you considered this situation?

asfimport commented 14 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

The 2nd comment above talks about this and the need for Lucene to offer more control over flush policy.

bq.it only matches the the parent document acurately for the 1st segment. I think this is due to the way the parent docs are marked using a bit array for the ENTIRE index

But aren't filters held and evaluated the within the context of each sub reader? Are you sure the issue isn't limited to a parent/child combo that is split across segments?

asfimport commented 14 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Attached Junit confirms issue with multiple segments (thanks, Buddika). Previous tests masked the error. I'm looking into a fix now.

Cheers, Mark

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Maybe we should add an addDocuments call to IW? To add more than one document, "atomically", so that any flush must happen before or after them?

asfimport commented 14 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Updated package with fix for multi-segment issue and improved Junit test

asfimport commented 14 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Maybe we should add an addDocuments call to IW? To add more than one document, "atomically", so that any flush must happen before or after them?

That would be nice. Another way of modelling this would be to introduce Document.add(Document childDoc) but I think that is a more fundamental and wide-reaching change.

asfimport commented 14 years ago

Buddika Gajapala (migrated from JIRA)

Mark, that was fast :)

BTW another scenario, when there are lot of data, there is a posibility of having parent docuemnt and matching child document in two different segments causing to miss some matches. I made a minor modification your approch by making it do a "Forward-scan" instead of reverse scan for parent docs and have the parent document inserted AFTER the child docs are inserted and in case of parent doc is located outside the scop of current doc, it's docid is preserved at the "Weight Object" level and nextDoc() modified to check fo that for the very 1st nextDoc call to new segment.

asfimport commented 14 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

I made a minor modification your approch by making it do a "Forward-scan" instead of reverse scan

Interesting, but I'm not sure what guarantees Lucene will make about:

That seems like a fragile set of dependencies. Also, don't things get tricky when reporting matches from NestedDocumentQuery and PerParentLimitedQuery back to the collector? During the query process the IndexSearcher resets the docId context (Collector.setNextReader) as it moves from one Scorer/segment to another. If we are delaying the assessment/reporting of matches until we've crossed a segment boundary it is too late to report a match on a child doc id from a previous segment as the collector has already changed context. Unfortunately "PerParentLimitedQuery" needs to do this when selecting the "best" children for a single parent.

asfimport commented 13 years ago

Paul Elschot (migrated from JIRA)

How about an implementation for strict hierarchies that uses two fields per document, in the following way:

The two fields each contain a single (indexed) token that indicates the node in the nesting hierarchy, one field meaning that the document is a child of that node, and the other that the document is the representative of that node. Any number of levels could be allowed, but no cycles of course. These fields are then used by a merge policy to keep the documents ordered postorder, that is the children immediately followed by the representative for each node. Collecting scores at any node in the hierarchy could then be done by using term filters, one for each involved scorer, to provide the representative for the current doc by advancing.

For example, in index order:

userDocId nodeMemberField nodeReprField

doc1 nodeA1 . doc2 nodeA1 . doc3 nodeA nodeA1 doc4 nodeA2 . doc5 nodeA2 . doc6 nodeA nodeA2

The node representatives for scoring could then be obtained by a term filter for nodeA.

I think this could work for the scoring part, basically along the lines of the code already posted here.

Could someone with more experience in segment merge policies comment on this? This is quite restrictive for merging as the only freedom that is left in the document order is the order of the children for each node.

For example, adding a leaf document doc7 for nodeA1 could result in the following index order:

doc4 nodeA2 . doc5 nodeA2 . doc6 nodeA nodeA2 doc7 nodeA1 . doc1 nodeA1 . doc2 nodeA1 . doc3 nodeA nodeA1

asfimport commented 13 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Hi Paul, I'm not sure I currently have an issue with merges as they just concatenate established segments without interleaving their documents. This operation should retain the order that is crucial to maintaining the parent/child/grandchild relationships (unless something has changed in merge logic which would certainly be an issue!). My main cause for concern is robust control over flushes so parent/child docs don't end up being separated into different segments at the point of arbitrary flushes.

I think your proposal here is related to a new (to me) use case where clients can add a single new "child" document and the index automagically reorganises to assemble all prior related documents back into a structure where they are grouped as contiguous documents held in the same segment? Please correct me if I am wrong. Previously I have always seen this need for reorganisation as an application's responsibility and a single child document addition required the app to delete the associated parent and all old child docs, then add a new batch of documents representing the parent, old children plus the new child addition. Given the implied deletes and inserts required to maintain relationship integrity that seems like an operation that needs to be done under the control of Lucene's transaction management APIs rather than some form of special MergePolicy which are really intended for background efficiency tidy-ups not integrity maintenance.

As for the fields you outline for merging , generally speaking in applications using NestedDocumentQuery and PerParentLimitedQuery I have found that for searching purposes I already need to store: 1) A globally unique ID as an indexed "primary key" field on the top-level container document 2) An indexed field with the same unique ID held in a different "foreign key" field on child documents 3) An indexed field indicating the document type e.g "root" or "resume" and "level1Child" or "employmentRecord"

I could be a little confused about your intentions - maybe should we start with what problem we are trying to solve before addressing how we achieve it?

Cheers Mark

asfimport commented 13 years ago

Paul Elschot (migrated from JIRA)

I think your proposal here is related to a new (to me) use case where clients can add a single new "child" document and the index automagically reorganises to assemble all prior related documents back into a structure where they are grouped as contiguous documents held in the same segment?

Indeed.

The first two fields match the ones I intended. The third field for the document type would be quite useful for searching, but it may not be necessary to maintain the document order.

The intention is quite simple: allow a set of documents to be used to provide a single score value during query searching. AFAICT that fits most of the use cases described here.

To allow conjunctions inside such a set, it is necessary to advance() a scorer into a set, and for that it might be better to put the set representative before the children. The document order would then be pre-order instead of post-order, which would not really make any difference in difficulty to keep the docs in order. With the representative before the children, an extra operation (sth like previousDocId()) would be needed on the iterator of the filter.

I don't know about flushes during merging. One operation that would recur during index maintenance is appending a sequence of documents from one segment to another segment, see docs 1, 2 and 3 above. This is indeed what needs to be done when a new child is added, or when an existing one is changed, i.e. deleted and added. I'm not familiar with the merging code, but I would suppose something very close to appending a sequence of documents from an existing segment is already available. Anyway this is costly, but that is the price to pay.

During searching, the term filters used for the node representatives might use some optimizations. Since one term filter is needed for every document scorer involved in searching the query and these term filters are all based on the same term, they could share index information, for example in a filter cache. A bit set is not always optimal for such filters, perhaps a more tree like structure could be more compact and faster. But bit sets could be used to get this going.

The good news so far for me is that this seems to be feasible, thanks.

asfimport commented 13 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

The intention is quite simple: allow a set of documents to be used to provide a single score value during query searching

That's what the existing NestedDocumentQuery code attached to this issue already provides. As far as I am concerned the search side works fine and I have it installed in several live installations (along with a bug fix for "skip" that I must remember to upload here). "Parent" filters as you suggest benefit from caching and I typically use the XMLQueryParser with a <CachedFilter> tag to take care of that (I need to upload the XMLQueryParser extensions for this Nested stuff too).

The new "intention" that I think you added in your last post was more complex and is related to indexing, not searching and introduced the idea that adding a new child doc on its own should somehow trigger some automated repair of the index contents. This repair would involve ensuring that related documents from previous adds would be reorganised such that all related documents still remained physically next to each other in the same segment. I don't think a custom choice of "MergePolicy" is the class to perform this operation - they are simply consulted as an advisor to pick which segments are ripe for a background merge operation conducted elsewhere. The more complex merge task you need to be performed here requires selective deletes of related docs from existing segments and addition of the same documents back into a new segment. This is a task I have always considered something the application code should do rather than relying on Lucene to second-guess what index reorganisation may be required. We could try make core Lucene understand and support parent/child relationships more fully but I'd settle for this existing approach with some added app-control over flushing as a first step.

asfimport commented 13 years ago

Paul Elschot (migrated from JIRA)

So the missing basic operation is a copy/append of a range of existing index docs. After that operation, the original docs can be deleted, but that is trivial.

I'll have a look at IndexWriter for this over the coming days. Any quick hints?

asfimport commented 13 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

I'm not sure the auto-repair is that trivial. Let's say the parent/child docs are resumes and nested docs for employment positions (as in the attached example). An update may not just be adding a new "employment position" doc but editing an existing one, deleting an old one etc. Your auto-updater is going to need to do a lot of figuring out to work out which existing docs need copying over from earlier segments and patching in to the new segment with the updated parts of the resume. This gets worse if we start to consider multiple levels to the hierarchy. It all feels like a lot of work for the IndexWriter to take on?

asfimport commented 13 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Updated version with fix for Scorer.advance issue and added XMLQueryBuilder support

asfimport commented 13 years ago

RynekMedyczny.pl (migrated from JIRA)

Mark, do you have any plans for including this feature into the Lucene trunk? I think that this is a "must have" feature since tree structures are so common! Thank you in advance.

asfimport commented 13 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Mark, do you have any plans for including this feature into the Lucene trunk?

That is my intention in providing it here. I had to work hard to convince my employer to let me release this as open source in the interests of seeing it updated/tested as core Lucene APIs change - and hopefully receive some improved support in IndexWriter "flush" control. Unfortunately it seems not everyone shares the pain when it comes to modelling richer data structures and seem content with the "flat" model we have in Lucene today. Code like this ends up in trunk when there is concensus so your support is welcome.

While core Lucene adoption is a relatively simple technical task, Solr adoption feels like a much more disruptive change.

asfimport commented 13 years ago

Jamal Natour (migrated from JIRA)

Mark,

For my project this is a must have feature that could decide the adoption of SOLR. What do think is the best way to help ensure this gets incorporated into SOLR?

Thank you, Jamal

asfimport commented 13 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Lucene does not dictate a schema and so using this approach to index design/querying is not a problem with base Lucene.

Solr, however does introduce a schema and much more that assumes a "flat" model. In the opening chapters of the "Solr 1.4 Enterprise Search Server" book the authors take the time to discuss the modelling limitations of this flat model and acknowledge this as an issue. The impact of adopting "nested" documents in Solr at this stage would be very large. There may be ways you can overcome some of your issues without requiring nested documents (using phrase/span queries or combining tokens from multiple fields in Solr) but in my experience these are often poor alternatives if richer structures are important.

Cheers Mark

asfimport commented 13 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

Solr, however does introduce a schema and much more that assumes a "flat" model.

In SOLR-1566 we could add a DocList as a field within a SolrDocument – this would at least allow the output format to return a nested structure.

I have not looked this patch so this comment may be off base.

asfimport commented 13 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

I have not looked this patch so this comment may be off base.

The slideshare deck gives a good overview: http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

As a simple Lucene-focused addition I'd prefer not to explore all the possible implications for Solr adoption here. The affected areas in Solr are extensive and would include schema definitions, query syntax, facets/filter caching, result-fetching, DIH etc etc. Probably best discussed elsewhere.

asfimport commented 13 years ago

RynekMedyczny.pl (migrated from JIRA)

Code like this ends up in trunk when there is concensus so your support is welcome.

Of course! How can we help you?

While core Lucene adoption is a relatively simple technical task

We are eagerly waiting for incorporating your work into Lucene Core!

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think this is a very important addition to Lucene, so let's get this done!

I just opened #4185, to add IW.add/updateDocuments, which would atomically add Document produced by an iterator, and ensure they all wind up in the same segment. I think this is the only core change necessary for this feature? Ie, all else can be built on top of Lucene once #4185 is committed?

asfimport commented 13 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

I think this is the only core change necessary for this feature?

Yup. A same-segment indexing guarantee is all that is required.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Attached patch, pulling out just NestedDocumentQuery from the zip file:

I put a few nocommits in that still need resolving...

My patch doesn't include XMLQueryParser integration; I think we should do this as separate issue?

I think PerParentLimitedQuery is the same functionality as grouping, right? Except, it only supports relevance withinGroup sort, but it operates only w/ a parent filter (whereas our grouping impls so far require a more-RAM-costly DocTermsIndex FC entry of the field you are grouping on). I think we should somehow merge this w/ the single pass grouping collector (#4202)?

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

One idea: maybe the parent doc should be indexed last, in the block, instead of first? Ie, so that NestedDocumentQuery.getParentDoc could use obs.nextSetBit()?

asfimport commented 13 years ago

Thomas Guttesen (migrated from JIRA)

Hi.

Great feature... I have some difficulties understanding the semantics/flow of document creation. Do you have to add the parent and child levels in any correct sequence? Or can you insert all parents and then insert all child levels later. The reason I as is that in my case I look for a one->many relation style insertion. I had hoped that I could add more child levels later.

Cheers

asfimport commented 13 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Thanks for the patch work, Mike. I'll need to check #4202 for equivalence with PerParentLimitQuery. It's certainly a central part of what I typically deploy for nested queries - pass 1 is usually a NestedDocumentQuery to get the best parents and pass 2 uses PerParentLimitQuery to get the best children for these best parents. Of course some apps can simply fetch ALL children for the top parents but in some cases summarising children is required (note: this is potentially a great solution for performance issues on highlighting big docs e.g. entire books).

I haven't benchmarked nextSetBit vs the existing "rewind" implementation but I imagine it may be quicker. Parent- followed-by-children seems more natural from a user's point of view however. I guess you could always keep the parent-then-child insertion order but flip the bitset (then cache) for query execution if that was faster. Benchmarking rewind vs nextSetbit vs flip then nextSetBit would reveal all.

Thomas - maintaining a strict order of parent/child docs is important and the recently-committed #4185 should help with this.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I'll need to check #4202 for equivalence with PerParentLimitQuery. It's certainly a central part of what I typically deploy for nested queries - pass 1 is usually a NestedDocumentQuery to get the best parents and pass 2 uses PerParentLimitQuery to get the best children for these best parents.

Hmm, so I wonder if we could do this in one pass? Ie, like grouping, if you indexed your docs as blocks, you can use the faster single-pass collector; but if you didn't, you can use the more general but slower and more-RAM-consuming two pass collector.

It seems like we should be able to do something similar with joins, somehow... ie Solr's join impl is a start at the "fully general" two-pass solution.

But I agree the "join child to parent" and then "grouping of child docs" go hand in hand for searching...

What do you do for facet counting in these apps...? Post-grouping faceting also ties in here.

Of course some apps can simply fetch ALL children for the top parents but in some cases summarising children is required

Right...

(note: this is potentially a great solution for performance issues on highlighting big docs e.g. entire books).

I think it'd be compelling to index book/articles with each page/section/chapter being a new doc, and then group them under their book/article.

I haven't benchmarked nextSetBit vs the existing "rewind" implementation but I imagine it may be quicker.

I think it should be much faster – obs.nextSetBit looks heavily optimized, since it can operate a word at a time. Though, if the groups are smallish, so that nextSetBit is often maybe 2 or 3 bits away, I'm not sure it'd be faster...

Parent- followed-by-children seems more natural from a user's point of view however.

But is it really so bad to ask the app to put parent doc last?

I mean, the docs have to be indexed w/ the new doc block APIs in IW anyway, which will often be eg a List<Document>, at which point putting parent last seems a minor imposition?

Since this is an expert API I think it's OK to put [minor] impositions on its usage if this can simplify the impl / make it faster / less risky. That said, I'm not yet sure on the impl (single pass query + collector vs generic two-pass join that solr now has), so it's probably premature to worry about this...

I guess you could always keep the parent-then-child insertion order but flip the bitset (then cache) for query execution if that was faster.

True but this adds some hair into the impl (we must also "flip" coming back from nextSetBit)...

Benchmarking rewind vs nextSetbit vs flip then nextSetBit would reveal all.

True, though it'd be best to do this in the context of the actual join impl...

asfimport commented 13 years ago

Paul Elschot (migrated from JIRA)

I see no test cases for required terms in a nested document. This may be non trivial in that advance() should advance into the first doc of the nested doc. For example, assume the parents p1 and p2 are the first docs in the nested docs, and that the query requires a and b to be present:

docId
0   p1
1   a
2   b
3   p2
4   b
5   a

In this situation, p2 may be missed when advance() on a required scorer for "b" is given docId 5 (containing "a") as a target. It should be given target docId 3 to advance into the nested doc p2 containing "a".

I quickly read the code here, but I could not easily determine whether this is done correctly or not. Shall I add a test case here, or would it be better to open another issue after this one is closed, or can someone reassure me that this is not in an issue?

asfimport commented 13 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Mike - having thought about it moving the per-parent grouping logic to a collector removes some of the functionality I am used to. I often use multiple PerParentLimitedQuery clauses in the same query to limit different aspects of the structures I query e.g. for each best parent document I may want the best 5 children of type A and the best 10 children of type B. Type "A" docs are of a different size and shape to type "B" - hence the need for different limits. I realise this can be done with multiple searches using different collectors each time but I use distributed servers that service all XML-based queries the same way. In these sorts of environments it is desirable to maintain a common query API (e.g. the XML syntax) and not introduce the need for special collectors to service different types of queries.

Paul, not sure on the exact details of your query (a parent with a child containing A and B or a parent with a child containing A and a potentially different child containing B?) I suspect a test case would help.

asfimport commented 13 years ago

Paul Elschot (migrated from JIRA)

Mark, this may occur when one child contains A and another contains B, and at least one scorer (the one for A and/or the one for B) uses advance() to get to the nested doc. I'll add a test case here, but that could take a few days.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK I opened #4244 to explore the single-pass approach.

asfimport commented 13 years ago

Paul Elschot (migrated from JIRA)

I finally had some time to start taking a look at the grouping module and again at the patch here. There is too much code there for me to come up with a test case soon. So please don't wait for me to commit this.

An easy way to test this would be to have a boolean query with required term and an optional term, with the optional term occurring in a document group in a document before (i.e. with a lower docId than) a document in the same group with a required term.

In case I run into this I'll open a separate issue.

asfimport commented 13 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Below are 2 example tests searching employment resumes - both using the same optional and mandatory clauses but in subtly different ways. Question 1 is "who has Mahout skills and preferably used them at Lucid?" while the other question is "who has Mahout skills and preferably has been employed by Lucid?". The questions and the answers are different. Below is the XML test script I used to illustrate the data/queries used, define expected results and run as an executable test. Hopefully you can make sense of this:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="test.xsl"?>
<Test description="NestedQuery tests">
    <Data>
        <Index name="ResumeIndex">
            <Analyzers class="org.apache.lucene.analysis.WhitespaceAnalyzer">
            </Analyzers>
            <Shard name="shard1">
                <!--  =============================================================== -->
                <Document pk="1">
                    <Field name="name">grant</Field>
                    <Field name="docType">resume</Field>
                </Document>
                <!--  =============================================================== -->
                        <Document pk="2">
                            <Field name="employer">lucid</Field>
                            <Field name="docType">employment</Field>
                            <Field name="skills">java lucene</Field>
                        </Document>
                <!--  =============================================================== -->
                        <Document pk="3">
                            <Field name="employer">somewhere else</Field>
                            <Field name="docType">employment</Field>
                            <Field name="skills">mahout and more mahout</Field>
                        </Document>
                <!--  =============================================================== -->
                <Document pk="4">
                    <Field name="name">sean</Field>
                    <Field name="docType">resume</Field>
                </Document>
                <!--  =============================================================== -->
                        <Document pk="5">
                            <Field name="employer">foo bar</Field>
                            <Field name="docType">employment</Field>
                            <Field name="skills">java</Field>
                        </Document>
                <!--  =============================================================== -->
                        <Document pk="6">
                            <Field name="employer">some co</Field>
                            <Field name="docType">employment</Field>
                            <Field name="skills">mahout mahout and more mahout</Field>
                        </Document>
            </Shard>
        </Index>
    </Data>
    <Tests>
        <Test description="Who knows Mahout and preferably used it *while employed at Lucid*?">
            <Query>
                <NestedQuery> 
                    <!-- testing properties of individual child employment docs -->
                   <Query>
                      <BooleanQuery>
                            <Clause occurs="must">
                                <TermsQuery fieldName="skills">mahout</TermsQuery>
                            </Clause>
                            <Clause occurs="should">
                                <TermsQuery fieldName="employer">lucid</TermsQuery>
                            </Clause>
                      </BooleanQuery>
                   </Query>
                   <ParentsFilter>  
                        <TermsFilter fieldName="docType">resume</TermsFilter>                        
                   </ParentsFilter> 
                </NestedQuery>
            </Query>
            <ExpectedResults why="Grant's tenure at Lucid is overlooked for scoring purposes 
                                   because it did not involve the required Mahout. Sean has more Mahout experience">
                            <Result fieldName="pk">4</Result>
                            <Result fieldName="pk">1</Result>
            </ExpectedResults>
        </Test>

        <!-- ==================================================================================== -->

        <Test description="Different question - who knows Mahout and preferably has been employed by Lucid?">
            <Query>
                <BooleanQuery>
                        <Clause occurs="must">
                            <NestedQuery> 
                                <!-- testing properties of one child employment docs -->
                               <Query>
                                    <TermsQuery fieldName="skills">mahout</TermsQuery>
                               </Query>
                               <ParentsFilter>  
                                    <TermsFilter fieldName="docType">resume</TermsFilter>                        
                               </ParentsFilter> 
                            </NestedQuery>
                        </Clause>
                        <Clause occurs="should">
                                <!-- Another NestedQuery testing properties of *potentially different* child employment docs -->
                            <NestedQuery> 
                               <Query>
                                    <TermsQuery fieldName="employer">lucid</TermsQuery>
                               </Query>
                               <ParentsFilter>  
                                    <TermsFilter fieldName="docType">resume</TermsFilter>                        
                               </ParentsFilter> 
                            </NestedQuery>
                        </Clause>
                    </BooleanQuery>
            </Query>
            <ExpectedResults why="Grant has the required Mahout skills plus the optional Lucid engagement">
                            <Result fieldName="pk">1</Result>
                            <Result fieldName="pk">4</Result>
            </ExpectedResults>
        </Test>
        <!-- ==================================================================================== -->
    </Tests>
</Test>
asfimport commented 13 years ago

Paul Elschot (migrated from JIRA)

That is very nicely readable XML.

The problem might occur when a document with an optional term occurs before a document in the same group with a required term. So the second question is the one for which the problem might occur. The score value Grant's resume should then be higher than the score value for Sean's. Testing only for the set of expected results is not enough for this particular query.

The problem might occur in another disguise when requiring both terms and then the set of expected results is enough to test, but this is not as easily tested because one does not know beforehand the order in which the terms are going to be advance()d. The case with an optional term is simpler to test because the optional term is certain to be advance()d to compute the score value after the required term determines that there is a match (see ReqOptSumScorer.score()), and then to be certain of the correct advance() on the optional term one needs to test the score value.