Open asfimport opened 2 years ago
Nhat Nguyen (@dnhatn) (migrated from JIRA)
+1. That'd be great. We can avoid writing docValues in refresh with soft deletes.
Nhat Nguyen (@dnhatn) (migrated from JIRA)
Soft-deletes can index already-soft-deleted documents. We need to make sure to support this feature when switching to LiveDocsFormat.
Rushabh Shah (@shahrs87) (migrated from JIRA)
@jpountz I am new to Lucene project. This will be 2nd issue. Given that this is a minor task, I would like to create a PR for this. Can you please elaborate the steps needed to tackle this issue ? Also if you can point me to some classes relevant to this patch where I can read more about the existing behavior, that would be helpful. Thank you.
Adrien Grand (@jpountz) (migrated from JIRA)
@shahrs87 I set the priority to minor, but in my opinion this is a pretty hard task, so I'm not sure it's a good fit for a 2nd issue unless you're already very familiar with how Lucene handles file formats.
Rushabh Shah (@shahrs87) (migrated from JIRA)
> unless you're already very familiar with how Lucene handles file formats.
@jpountz Thank you for the reply. I am not at-all familiar with the file formats. Can you suggest some blog/article or some class names where I can learn more about the different file formats ?
Hi @jpountz @dnhatn, I have been looking at this issue and studying the code for soft delete, and have some general understanding of the complexity of this issue. If I understand it correctly, here are the primary major work that needs to be done if we were to migrate soft delete handling to live docs:
SoftDeletesRetentionMergePolicy
to apply / combine soft and hard delete bitsets, so that the upper layers do not get impacted by liv format change.However, I do have a few questions but couldn't seems to find answers in the code / comment / original task tickets. I'm wondering if you may have some insights in these?
MergePolicy policy =
new SoftDeletesRetentionMergePolicy(
"soft_delete", () -> new DocValuesNumbersQuery("soft_delete", Collections.list(2L, 3L, 100L)), new LogDocMergePolicy());
@zacharymorn Thank you for looking into the issue.
The first proposals of the API: https://issues.apache.org/jira/browse/LUCENE-8198 and https://issues.apache.org/jira/browse/LUCENE-8200.
Is that something we could deprecate directly?
Good point, but I don't think we should deprecate it until we figure out how to index already-soft-deleted documents.
Thanks @dnhatn for the pointers! I took some further look but still couldn't seems to find a comparison discussion between doc value approach vs. encoding it into liv doc initially?
@s1monw @rmuir not sure if you may recall any for the following question?
Was there any historical context / reason of not encoding soft delete into liv doc in the first place? From https://issues.apache.org/jira/browse/LUCENE-8233, it looks like the API design were proposed and accepted to give users some flexibility?
Soft-deletes can index already-soft-deleted documents. We need to make sure to support this feature when switching to LiveDocsFormat. I don't think we should deprecate it until we figure out how to index already-soft-deleted documents.
Sorry I had forgotten to ask about this earlier. By "index already-soft-deleted documents", are you referring to the "undelete" feature as mentioned in https://issues.apache.org/jira/browse/LUCENE-8198 ?
By "index already-soft-deleted documents", are you referring to the "undelete" feature as mentioned in https://issues.apache.org/jira/browse/LUCENE-8198 ?
They are different things. I meant the capacity to index a tombstone document as in this test: https://github.com/apache/lucene/blob/73e2ae2705ed525a036972f6df089af478f3b534/lucene/core/src/test/org/apache/lucene/index/TestSoftDeletesRetentionMergePolicy.java#L461-L464
yeah i personally disagree with this issue. I don't think there is any hack. why make livedocsformat more complicated for no good reason?
The tomb-stoning ability is likely one of the major reasons why I used docvalues for this feature. The other was that it didn't require to change an existing format and it's API for it. We have always been conservative with changing stable and public APIs especially on that level unless there was a good reason for it. I can see the benefit of having a dedicated soft-live format if we are willing to loose the tombstone ability (with a payload) since as @dnhatn pointed out correctly we would not have to write the DV on refresh for the soft deletes field. Yet, I doubt it makes a significant difference.
Thanks @dnhatn @rmuir @s1monw for the additional information! Yeah I can see now how changing it to use liv doc and not relying on an explicit field, will potentially require changes to the softUpdateDocument
API to differentiate between "regular" doc vs. tombstone doc, and also make liv doc format itself and its usage more complicated (indeed nothing can beat bitset in terms of simplicity!). I'll pause exploring on this from my end then, but will be happy to work on it further if there's any preference change down the road.
We have been using doc values to handle soft deletes until now, but this is a bit of a hack as it: - forces users to reserve a field name for doc values
It would also be more natural to have both hard and soft deletes handled by the same file format?
Migrated from LUCENE-10392 by Adrien Grand (@jpountz), updated May 17 2022