apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.64k stars 1.02k forks source link

Deletion by query of uncommitted docs not working with DV updates [LUCENE-7344] #8398

Open asfimport opened 8 years ago

asfimport commented 8 years ago

When DVs are updated, delete by query doesn't work with the updated DV value.


Migrated from LUCENE-7344 by Ishan Chattopadhyaya (@chatman), updated Dec 22 2020 Attachments: LUCENE-7344_hoss_experiment.patch, LUCENE-7344.patch (versions: 6) Linked issues:

asfimport commented 8 years ago

Ishan Chattopadhyaya (@chatman) (migrated from JIRA)

Here's a failing test indicating the situation. Can someone please review and verify that this is indeed a bug? Thanks.

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

This is indeed a bug, but I can't yet see how we can fix it.

The hairy code that resolves deletes and DV updates (BufferedUpdateStream.java has a hard assumption that segments are write-once as far as queries resolving to docIDs is concerned, and updatable DVs breaks that assumption.

To properly fix this, at a high level, the code would need to do something along the lines of merge-sorting all DV updates and delete-by-query in order of their arrival to IndexWriter (docIDUpto proxies this), and then apply these in order, but after any DV update and just before a delete-by-query, we'd need to open a new LeafReader on that segment so it would reflect the DV updates (maybe just calling ReadersAndUpdates.writeFieldUpdates, and then pulling the reopened reader, would suffice for this).

But even if we can fix that, the code would then have an inherent terrible adversarial worst case of alternating DV update / DBQ.

Rather than fix this bug, I would lean towards requiring/documenting that DBQ must not use queries that rely on updated doc values.

asfimport commented 8 years ago

Shai Erera (@shaie) (migrated from JIRA)

Hmm ... I had to refresh my memory of the DV updates code and I agree with @mikemccand that the fix is hairy (which goes hand-in-hand with the hairy BufferedUpdatesStream). The problem is that deleteByQuery uses the existing LeafReader, but the DV updates themselves were not yet applied so the reader is unaware of the change.

I changed the test to call updateDocument instead of updating the NDV and the test passes. This is expected because updating a document deletes the old one and adds a new document. So when DBQ is processed, a LeafReader is opened on the new segment (with the new document; it has to work that way cause the new document isn't yet flushed) and the new segment thus has the new document with the updated NDV.

I agree this is a bug only because updating a document followed by DBQ works as expected. The internals of how in-place updates are applied should not concern the user.

I wonder if we need to implement a complex merge-sorting approach as @mikemccand proposes, or if we applied the DV updates before processing and DBQ would be enough (ignoring the adversarial affects that Mike describes; they're true, but I ignore them for the moment). I want to try that.

If that works, then perhaps we can detect if a DBQ involves an NDV field (or BDV field for that matter) and refresh the reader only then, or refresh the reader whenever there are DBQ and any DV updates, even if they are unrelated. But first I want to try and make the test pass, before we decide on how to properly fix it.

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Maybe another option is to somehow make a wrapped LeafReader that exposes the point-in-time view for its delegate, but will also expose the live pending updated DVs when doc values are requested? This would need to expose something that says how far (by docID) into those updates should this reader expose, per DBQ that's applied. This should avoid the adversarial cases I think, or at least greatly minimize their impact.

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Maybe a better option is to remove the doc values based queries. One really shouldn't be querying based on doc values, i.e. the values should be indexed as points if range querying is going to be used.

asfimport commented 8 years ago

Shai Erera (@shaie) (migrated from JIRA)

Patch applies the DBQ after resolving the DV updates. With this patch, if there were DV updates, then the SegmentState is updated with the up-to-date reader.

Note that this does not mean more work compared to what was done before – if there are no DV updates, writeFieldUpdates isn't called, and no reader is updated. If there were field updates, then writeFieldUpdates was called anyway, refreshing the internal reader.

This patch does not change the behavior, except it also updates the SegmentState.reader if there were DV updates.

@mikemccand what do you think? Our SegmentReader already only refreshes the DV updates, that is it already maintains a view of the bare segment with the modified DV fields. Also, given what I wrote above, I don't believe that we're making more SR reopens?

asfimport commented 8 years ago

Shai Erera (@shaie) (migrated from JIRA)

After chatting with Mike about this, here's an example for an "interleaving" case that Mike mentioned, where this patch does not work:

    writer.updateNumericDocValue(new Term("id", "doc-1"), "val", 17L);
    writer.deleteDocuments(DocValuesRangeQuery.newLongRange("val", 5L, 10L, true, true));
    writer.updateNumericDocValue(new Term("id", "doc-1"), "val", 7L);

Here, "doc-1" should not be deleted, because the DBQ is submitted before the DV update, but because we resolve all DV updates before DBQ (in this patch), it ends up deleted. This is wrong of course. I'm looking into Mike's other idea of having a LeafReader view with the DV updates up until that document, and then ensuring DV updates / DBQs are applied in the order they were submitted. This starts to get very complicated.

asfimport commented 8 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

Shai: I don't understand most of what you're saying, but in the interest of helping out with the easy stuff so you can focus on the hard stuff I've updated the patch with some additional tests...

asfimport commented 8 years ago

Shai Erera (@shaie) (migrated from JIRA)

I don't understand most of what you're saying

To clarify the problem, both for you but also in the interest of writing a detailed plan of the proposed solution: currently when a DBQ is processed, it uses the LeafReader without the NDV updates, and therefore has no knowledge of the updated values. This is relatively easily solved in the patch I uploaded, by applying the DV updates before the DBQ is processed. That way, the DBQ uses a LeafReader which is already aware of the updates and all works well.

However, there is an order of update operations that occur in IndexWriter. In our case it could be a mix in of DBQ and NDV updates. So if we apply all the DV updates before any of the DBQs, we'll get incorrect results where the DBQ either delete a document it shouldn't (see code example above, and also what your testDeleteFollowedByUpdateOfDeletedValue shows), or not delete a document that it should.

To properly solve this problem, we need to apply the DV updates and DBQs in the order they were received (as opposed to applying them in bulk in current code). Meaning if the order of operations is NDVU1, NDVU2, DBQ1, NDVU3, DBQ2, DBQ3, NDVU4, then we need to:

  1. Apply NDVU1 + NDVU2; this will cause a new LeafReader to be created
  2. Apply DBQ1; using the already updated LeafReader
  3. Apply NDVU3; another LeafReader will be created, now reflecting all 3 NDV updates
  4. Apply DBQ2 and DBQ3; using the updated LeafReader from above
  5. Apply NDVU4; this will cause another LeafReader to be created

The adversarial affect in this case is that we cause 3 LeafReader reopens, each time (due to how NDV updates are currently implemented) writing the full DV field to a new stack. If you have many documents, it's going to be very expensive. Also, if you have a bigger sequence of interleaving updates and deletes, this gets worse and worse.

And so here comes the optimization that Mike and I discussed above. Since the NDV updates are held in-memory until they're applied, we can avoid flushing them to disk and creating a LeafReader which reads the original DV field + the in-memory DV updates. Note though: not all DV updates, but only the ones that are relevant up until this point. So in the case above, that LeafReader will view only NDVU1 and NDVU2, and later it will be updated to view NDVU3 as well.

This is purely an optimization step and has nothing to do with correctness (of course, that optimization is tricky and needs to be implemented correctly!). Therefore my plan of attack in this case is:

  1. Have enough tests that try different cases before any of this is implemented. For example, Mike proposed above to have the LeafReader + DV field "view" use docIdUpto. I need to check the code again, but I want to make sure that if NDVU2, NDVU3 and NDVU4 (with the interleaving DBQs) all affect the same document, everything still works.
  2. Implement the less-efficient approach, i.e. flush the DV updates to disk before each DBQ is processed. This ensures that we have a proper solution implemented, and we leave the optimization to a later step (either literally a later commit, or just a different patch or whatever). I think this is complicated enough to start with.
  3. Improve the solution to avoid flushing DV updates between the DBQs, as proposed above.

testBiasedMixOfRandomUpdates

I briefly reviewed the test, but not thoroughly (I intend to). However, notice that committing (hard/soft ; commit/NRT) completely avoids the problem because a commit/NRT already means flushing DV updates. So if that's what this test does, I don't think it's going to expose the problem. Perhaps with the explanation I wrote above, you can revisit the test and make it fail though.

asfimport commented 8 years ago

Robert Muir (@rmuir) (migrated from JIRA)

The performance of this seems really trappy, I am not sure we should do this?

Maybe we should just document the limitation unless there is a cleaner way.

asfimport commented 8 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Here, "doc-1" should not be deleted, because the DBQ is submitted before the DV update, but because we resolve all DV updates before DBQ (in this patch), it ends up deleted.

For Solr, this can never happen (in solr-cloud mode at least)... All updates are blocked during A DBQ, and then we re-open the reader before releasing the lock. All we need at the Solr level is to resolve DV updates before resolving deletes (since we know that the DQB will always be last).

asfimport commented 8 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

I briefly reviewed the test, but not thoroughly (I intend to). However, notice that committing (hard/soft ; commit/NRT) completely avoids the problem because a commit/NRT already means flushing DV updates. So if that's what this test does, I don't think it's going to expose the problem.

Understood – but i was trying to write a generally robust randomized test that could sometimes commit (incase that uncovered otherproblems), not just a test targetting this specific problem (we already have that)

Reading through teh output, I realized the reason I wasn't seeing any failures even after man many runs was because the spread of unique values in the DV field (Long.MIN_VALUE to Long.MAX_VALUE) was just too large relative the size of the ranges i was using for the deletes.

I refactored the code int oa helper method that is now called from multiple tests – so testBiasedMixOfRandomUpdates should now be functionally equivilent to what it was before this patch, but now we also have new test methods like testBiasedMixOfRandomUpdatesWithNarrowValuesAndDeletes (-1000L to 1000L), testBiasedMixOfRandomUpdatesWithDeletesAndCommits (using the full spectrum of valid Longs), testBiasedMixOfRandomUpdatesWithCommitsAndLotsOfDeletes (using the full spectrum of valid Longs, but really hammering with lots of deletes), etc...

I also added in more checks of the expected values using NRT readers periodically (every 100 ops)

It's now easy to get failures (and AFAICT that are failure due to the bug/patch, not just silly test mistakes)


What this doesn't yet have (because it didn't occur to be me, because it hasn't come up yet in how Solr is trying to use updateNumericDocValue()) is tests that interleave multiple updateNumericDocValue() calls that affect multiple overlapping sets of docs, with deleteDocuments() calls that affect subsets of those documents (eg: delete by BooleanQuery that wraps a DocValuesRangeQuery and another mandatory clause) ... i'll try tackling that next.

asfimport commented 8 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

Update patch...

What this doesn't yet have (...) is tests that interleave multiple updateNumericDocValue() calls that affect multiple overlapping sets of docs, with deleteDocuments() calls that affect subsets of those documents ...

...patch now includes this by adding "modN" fields containing the value of "doc-id % N" so that in some of the updateNumericDocValue, instead of updating the value for a single doc via new Term("id","doc-42") we can instead update many docs via new Term("mod7","mod-3"). As expected, only the \*MixOf\* tests that involve "deletes" fail with these changes.

(I skipped the "delete by BooleanQuery ..." part of my previous comment, because I realized it shouldn't matter, as long as the set of docs matching the delete query does directly corolate to the set of documents in any update)


Shai: if you can think of any additional tests you'd like to see I'll be happy to help write them.

asfimport commented 8 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

Something yonik pointed out in SOLR-5944 is that Solr could possibly work around this issue if we always opened an NRT Searcher/Reader prior to doing any deleteByQuery calls – but that ideally we could do that conditionally based on knowing if/when any updateDocVal calls had happened to the IndexWriter since the last time an NRT Searcher/Reader was opened.

I didn't really understand that comment at the time, but talking about it offline with shalin/ishan a bit I realized it's essentially just doing what Shai suggested was the "To properly solve this problem..." solution, but by putting the burden on the caller to "force" the IndexWriter to reopen it's LeafReaders to return an uptodate NRT Reader, and then IndexWriter will use those uptodate LeafReaders for applying subsequent deletes.

The LUCENE-7344.patch I'm attaching here includes a proof of concept of this workaround – written largely to convince myself it would actually work....

In this LUCENE-7344.patch, I have:

...with these changes, all tests seem to pass reliably.

Assuming we document that forcing a reader reopen like this is required for users in this situation (ie: a caveat on the updateDocVals javadocs that doing something like this is required in order for subsequent deleteDocuments(Query) calls to succeed if the delete Query refers to the updated DocValues field) that could get us most of the way towards a low-impact (on performance/complexity of IndexWriter) solution unless/until Shai can a more optimized solution working – at which point we could just remove the caveat from the javadocs.

But...

I'm leary of a solution that:

  1. relies on users to read javadocs carefully
  2. relies on callers of deleteDocuments(Query) to know/realize if/when that Query involves DocValues and know if any other callers may have used update(Binary|Numeric)DocValues
    • particularly since the "Query" may have come from a upstream code
  3. results in silent unexpected data loss when the rules spelled out in the javadocs aren't followed.

So with that in mind, I started wondering what it would take to:

The first seemed like it would be somewhat straightforward, and not strictly neccessary for corectness, So I tried tackling the second one.

This experiment can be seen in the attached LUCENE-7344_hoss_experiment.patch (apply after LUCENE-7344.patch)

The crux of the change is adding new code to BufferedUpdatesStream.applyQueryDeletes to check if there are any DVUpdates, and if there are it wraps the LeafReader in a FilterLeafReader that throws an Exception if LeafReader.get(Binary|Numeric)DocValues(...) is called on any of the field names used in the DVUpdates – so any usage of the DocValues on those fields in the deletion Query will trigger an exception and the callers will know something is wrong.

For the most part this experiment seems viable – except that the new Exception also gets thrown on writer.close() (or when opening an NRT reader) when there have been dvUpdates after a deleteDocuments(Query) that uses the same DocValues field – for example:

    writer.addDocument(doc(0));
    writer.deleteDocuments(DocValuesRangeQuery.newLongRange("val", 5L, 10L, true, true));
    writer.updateNumericDocValue(new Term("id", "doc-0"), "val",  7L);

    // ...this would cause my new error checking code to throw CanNotDeleteByQueryOnUpdatedDocValues
    writer.close(); 

I'm not really sure why that's happening, since (afaict) the deleteDocuments(Query) should have already been applied and been taken into account by a (reopened) reader, so it seems like the queriesIter passed to applyQueryDeletes when the writer is closed should be empty? ... but whatever the reason, it can also be worked around by forcing a new reader after any deleteDocuments(Query) that will later be updated by an updateDocVal (or, depending on how you look at it, before the updateDocVal – toe-MAY-toe, toe-MAH-toe) ... see FORCE_NEW_READER_HACK_POST = true


In any case, I'm still hoping Shai's approach at an "internal" fix pans out – but I wanted to toss this out as a strawman solution that seems better then "don't call these methods in this ways or you may silently lose data"

asfimport commented 3 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I know it's been 4 years but I want to add that a Lucene Query has a Weight with a relatively new method, isCacheable that is a reasonable approximation for wether Solr DBQ processing could skip the realtimeSearcher reopen.  This method must return false if there are any DocValue dependent clauses in the Query.  It can also return false in some other cases (e.g. BooleanQuery with many clauses) but usually it will not.