Open asfimport opened 7 years ago
Adrien Grand (@jpountz) (migrated from JIRA)
This particular error means that there is a problem in the way your index is structured since you had at least one segment that did not have a parent doc as a last document. This is wrong because block joins work on blocks of documents that contain 0-n children followed by one parent so the last document is necessarily a parent document.
Tim Underwood (@tpunder) (migrated from JIRA)
Thanks @jpountz! I'm trying to figure out if this an issue on my side (very possible) or if it's a Solr or Lucene issue.
All my indexing goes through Solr (via SolrJ) and as far as I can tell I'm not attempting to index any child documents without a corresponding parent document. I'm not even sure if Solr or SolrJ would allow me to do that.
Does it make sense that optimizing the index would cause the problem to go away?
I think I was able to snag a copy of the index that was causing problems before the optimized version was able to replicate. Any suggestions/pointers for trying to track down whatever docs are problematic? Will running CheckIndex on it tell me anything useful?
Mikhail Khludnev (@mkhludnev) (migrated from JIRA)
@tpunder, it usually happens when uniqueKey is duplicated, it causes deleting former parent doc.
It can be verified with org.apache.lucene.search.join.CheckJoinIndex
, although it doesn't have main()
method.
@jpountz, what if will invoke CheckJoinIndex
logic lazily somewhere in org.apache.lucene.search.join.QueryBitSetProducer.getBitSet(LeafReaderContext)
? It won't cost much as well it should be lazy, but provides more predictable behaviour for users.
Tim Underwood (@tpunder) (migrated from JIRA)
Thanks @mkhludnev! Running CheckJoinIndex on my bad index (assuming I got my parentsFilter right) says:
java.lang.IllegalStateException: Parent doc 3324040 of segment _vfo(6.3.0):C28035360/10475131:delGen=86 is deleted but has a live child document 3323449
Running CheckJoinIndex on the optimized version of the index doesn't complain.
So... that leaves me wondering where the bug is. I am frequently (via Solr) re-indexing parent/child documents that duplicate existing documents based on my unique key field but my understanding is that Solr should automatically delete the old parent and child documents for me. Maybe thats a bad assumption.
It looks like maybe I'm running into one or more of these issues: SOLR-5211, SOLR-5772, SOLR-6096, SOLR-6596, SOLR-6700
Sounds like I should probably just make sure I explicitly delete any old parent/child documents that I'm replacing to be on the safe side.
Tim Underwood (@tpunder) (migrated from JIRA)
I also noticed that I have some deleteByQuery calls that target parents documents but not their children (my assumption being that Solr or Lucene would also delete the corresponding child documents). Perhaps that is what is causing the orphan child documents. I'll be sure to explicitly delete those also.
Mikhail Khludnev (@mkhludnev) (migrated from JIRA)
LUCENE-7674.patch introduces CheckingQueryBitSetProducer
which checks parent segment's bitset before caching and switches \{\!parent} \{\!child
} to use it. It laid well, beside of, and it's interesting! BJQParserTest.testGrandChildren()
. When we have three levels: parent, child, grand-child and searching for children (2nd level), it requires to include all ascendant levels (parent) in bitset. This, will break existing queries for those who run more than two level blocks. But such explicitly strict behavior solves problems for those who tires to retrieve intermediate levels by [child] then, I remember a couple of such threads in the list.
What do you think?
Mikhail Khludnev (@mkhludnev) (migrated from JIRA)
@tpunder, you've got everything right! Thanks for gathering those pet peeves in the list. Here is one more, SOLR-7606 - it's my favorite ones. I need to tackle them sooner or later.
Mikhail Khludnev (@mkhludnev) (migrated from JIRA)
@jpountz , @uschindler, what's your opinion about CheckingQueryBitsetProducer
and restricting multilevel blocks?
Adrien Grand (@jpountz) (migrated from JIRA)
It feels wrong to me that we enforce these rules at search time, while they should be enforced at index time. I think the true fix to all these block join issues would be to make Solr know queries that describe the parent and child spaces rather than expect users to provide them at search time. Then once it knows that, it could reject update/delete operations that would break the block structure, fail queries that use a parent query that is not one of the expected ones, maybe add a FILTER clause to the child query to restrict it to the child space in case some fields are used at multiple levels, etc.
Uwe Schindler (@uschindler) (migrated from JIRA)
I agree with Adrien. The current block join support in Solr is a desaster, because it was released to early. Just nuke the broken APIs and create a new one, so Solr internally knows from schema/mapping how to block join and also prevent misformed updates. This is also worth a backwards compatibility break! Doing expensive runtime checks on every query just to keep a broken API/implementation is not a good idea. Break hard and come with a better API, the users will still be more happy, trust me. I know so many users who f*ck up the block joins, as Solr does not enforce it correctly. Do the following:
Mikhail Khludnev (@mkhludnev) (migrated from JIRA)
Oh.. I've got your point, guys. Thanks. I'd probably raise gsoc ticket and try to scratch backlog.
David Smiley (@dsmiley) (migrated from JIRA)
+1 to Adrien Uwe's remarks. It was released too early.
Mikhail Khludnev (@mkhludnev) (migrated from JIRA)
Ok. I started to scratch the spec at SOLR-10144. Everybody are welcome. Meanwhile, I tried to reproduce this exact failure to come up with more informative message. But it seems like it's impossible - recently redesigned BlockJoinQuery ignores children behind the last parent in segment.
Started seeing this error message on a production Solr 6.3.0 system today making use of parent/child documents:
The "docId=2147483647" part seems suspicious since that corresponds to Integer.MAX_VALUE and my index only has 102,013,289 docs in it. According to the Solr searcher stats page I have:
numDocs: 71,870,998 maxDocs: 102,013,289 deletedDocs: 30,142,291
I took the query that was failing and attempted to intersect my parent query with the child query to find any problem docs but that came back with 0 results.
After performing an optimize (via the Solr UI) on the index the problem has gone away and the query that previously triggered this error works as it should.
Migrated from LUCENE-7674 by Tim Underwood (@tpunder), updated Feb 16 2017 Attachments: LUCENE-7674.patch, LUCENE-7674-attempt-to-reproduce.patch Linked issues: