apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.58k stars 1.01k forks source link

Handle soft deletes via LiveDocsFormat [LUCENE-10392] #11428

Open asfimport opened 2 years ago

asfimport commented 2 years ago

We have been using doc values to handle soft deletes until now, but this is a bit of a hack as it:  - forces users to reserve a field name for doc values

It would also be more natural to have both hard and soft deletes handled by the same file format?


Migrated from LUCENE-10392 by Adrien Grand (@jpountz), updated May 17 2022

asfimport commented 2 years ago

Nhat Nguyen (@dnhatn) (migrated from JIRA)

+1. That'd be great. We can avoid writing docValues in refresh with soft deletes.

asfimport commented 2 years ago

Nhat Nguyen (@dnhatn) (migrated from JIRA)

Soft-deletes can index already-soft-deleted documents. We need to make sure to support this feature when switching to LiveDocsFormat.

asfimport commented 2 years ago

Rushabh Shah (@shahrs87) (migrated from JIRA)

@jpountz  I am new to Lucene project. This will be 2nd issue. Given that this is a minor task, I would like to create a PR for this. Can you please elaborate the steps needed to tackle this issue ? Also if you can point me to some classes relevant to this patch where I can read more about the existing behavior, that would be helpful. Thank you.

asfimport commented 2 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

@shahrs87 I set the priority to minor, but in my opinion this is a pretty hard task, so I'm not sure it's a good fit for a 2nd issue unless you're already very familiar with how Lucene handles file formats.

asfimport commented 2 years ago

Rushabh Shah (@shahrs87) (migrated from JIRA)

> unless you're already very familiar with how Lucene handles file formats.

@jpountz Thank you for the reply. I am not at-all familiar with the file formats. Can you suggest some blog/article or some class names where I can learn more about the different file formats ?

zacharymorn commented 1 year ago

Hi @jpountz @dnhatn, I have been looking at this issue and studying the code for soft delete, and have some general understanding of the complexity of this issue. If I understand it correctly, here are the primary major work that needs to be done if we were to migrate soft delete handling to live docs:

  1. Define a new format for liv to encode soft delete (and the naive way to do this is probably to have two bits per doc to indicate hard and soft deletes?).
  2. Update SoftDeletesRetentionMergePolicy to apply / combine soft and hard delete bitsets, so that the upper layers do not get impacted by liv format change.
  3. Update soft delete APIs to remove the need to specify a dedicated dv field.
  4. Update soft delete internal logic to potentially build and maintain a separate bitset from hard delete, and write them both to disc.

However, I do have a few questions but couldn't seems to find answers in the code / comment / original task tickets. I'm wondering if you may have some insights in these?

  1. Was there any historical context / reason of not encoding soft delete into liv doc in the first place? From https://issues.apache.org/jira/browse/LUCENE-8233, it looks like the API design were proposed and accepted to give users some flexibility?
  2. Given now applications can technically write any numeric value into the soft delete field, and may somehow use that value as part of the soft delete retention query, migrating the encoding into a potentially binary liv format may not support that use case any more? Is that something we could deprecate directly? Here's an example of such a retention query:
MergePolicy policy =
        new SoftDeletesRetentionMergePolicy(
            "soft_delete", () -> new DocValuesNumbersQuery("soft_delete", Collections.list(2L, 3L, 100L)), new LogDocMergePolicy());
dnhatn commented 1 year ago

@zacharymorn Thank you for looking into the issue.

The first proposals of the API: https://issues.apache.org/jira/browse/LUCENE-8198 and https://issues.apache.org/jira/browse/LUCENE-8200.

Is that something we could deprecate directly?

Good point, but I don't think we should deprecate it until we figure out how to index already-soft-deleted documents.

zacharymorn commented 1 year ago

Thanks @dnhatn for the pointers! I took some further look but still couldn't seems to find a comparison discussion between doc value approach vs. encoding it into liv doc initially?

@s1monw @rmuir not sure if you may recall any for the following question?

Was there any historical context / reason of not encoding soft delete into liv doc in the first place? From https://issues.apache.org/jira/browse/LUCENE-8233, it looks like the API design were proposed and accepted to give users some flexibility?


Soft-deletes can index already-soft-deleted documents. We need to make sure to support this feature when switching to LiveDocsFormat. I don't think we should deprecate it until we figure out how to index already-soft-deleted documents.

Sorry I had forgotten to ask about this earlier. By "index already-soft-deleted documents", are you referring to the "undelete" feature as mentioned in https://issues.apache.org/jira/browse/LUCENE-8198 ?

dnhatn commented 1 year ago

By "index already-soft-deleted documents", are you referring to the "undelete" feature as mentioned in https://issues.apache.org/jira/browse/LUCENE-8198 ?

They are different things. I meant the capacity to index a tombstone document as in this test: https://github.com/apache/lucene/blob/73e2ae2705ed525a036972f6df089af478f3b534/lucene/core/src/test/org/apache/lucene/index/TestSoftDeletesRetentionMergePolicy.java#L461-L464

rmuir commented 1 year ago

yeah i personally disagree with this issue. I don't think there is any hack. why make livedocsformat more complicated for no good reason?

s1monw commented 1 year ago

The tomb-stoning ability is likely one of the major reasons why I used docvalues for this feature. The other was that it didn't require to change an existing format and it's API for it. We have always been conservative with changing stable and public APIs especially on that level unless there was a good reason for it. I can see the benefit of having a dedicated soft-live format if we are willing to loose the tombstone ability (with a payload) since as @dnhatn pointed out correctly we would not have to write the DV on refresh for the soft deletes field. Yet, I doubt it makes a significant difference.

zacharymorn commented 1 year ago

Thanks @dnhatn @rmuir @s1monw for the additional information! Yeah I can see now how changing it to use liv doc and not relying on an explicit field, will potentially require changes to the softUpdateDocument API to differentiate between "regular" doc vs. tombstone doc, and also make liv doc format itself and its usage more complicated (indeed nothing can beat bitset in terms of simplicity!). I'll pause exploring on this from my end then, but will be happy to work on it further if there's any preference change down the road.