acryldata / datahub-helm

Repository of helm charts for deploying DataHub on a Kubernetes cluster
Apache License 2.0
160 stars 239 forks source link

fix(restore-indices): always clear search and graph service before restoring indices #475

Closed Masterchen09 closed 2 weeks ago

Masterchen09 commented 3 months ago

I recently played around a bit with the restore-indices job and noticed that under some circumstances the restoration might have an inconsistent result. If we say that the database is the single-point-of-truth of everything and especially when restoring the indices, then the search index should only contain entities which are also present in the database after restoring indices.

Let's assume that the search index and the database are getting out of sync and entities are deleted - for whatever reason the deletion is not processed by the mae-consumer, which means that the entities were not deleted in the search index, then there is currently no possibility with the existing restore-indices job to remove these entities in the search index - the current restore-indices job will only add missing entities, but will not delete non-existent entities.

There is the possibility to clean-up the indices before restoring the indices, however this is not enabled by default in the job template - I think to have consistent results after restoring indices the indices should always be cleaned-up before.

One drawback of this would be, that when the restore-indices job is started DataHub would not show anything, because the indices are cleaned-up...alternatively, instead of adding the "-a clean" option to the existing job template, there could be a second job template which includes this option (e.g., "datahub-clean-restore-indices-job-template"). Please suggest what's the best approach here - I am happy to change the PR accordingly. 😊

Checklist

github-actions[bot] commented 2 months ago

This PR is stale. We will close it in 30 days if there is no comment or activity. If you want feedback but not able to get it on github please head to #contribute channel in slack at https://slack.datahubproject.io.

github-actions[bot] commented 1 month ago

This PR is stale. We will close it in 30 days if there is no comment or activity. If you want feedback but not able to get it on github please head to #contribute channel in slack at https://slack.datahubproject.io.

Masterchen09 commented 1 month ago

To make sure that this PR is not automatically closed - here’s a comment.

darnaut commented 1 month ago

I think to have consistent results after restoring indices the indices should always be cleaned-up before.

@Masterchen09 This is nice in theory but it effectively creates problems for anyone that has to restart/resume the job for whatever reason, or if just trying the job for the first time. If this is a default that makes sense for your setup, please feel free to add it behind a flag. Otherwise, such behavior that leads to data deletion should not be enabled by default.

Masterchen09 commented 1 month ago

@darnaut What about a second job template where the option is enabled? Using the datahubUpgrade.restoreIndices.image.args option it is already possible to adjust the existing job template, but you do not have both options available and changing the existing job „afterwards“ (or on demand) would make a redeployment of the chart necessary…

darnaut commented 1 month ago

What about a second job template where the option is enabled?

So the tradeoff here is yet another duplicated template that needs to be maintained, or using the existing args option for your use case. Unless there are other users expressing the need for this particular flavor of the restore-indices template, I would strongly suggest using the available options.

Masterchen09 commented 2 weeks ago

@darnaut Then we should at least document it a bit more clearly that, without cleaning-up the indices, there could be inconsistencies between the local database and the indices. I can live with modifying the arguments for the restore indices job, I just wanted to make sure that it is also consistent for other users, because I suspect that others are not aware of the potential inconsistencies...what is a data catalog without consistency? 😉

See here for a proposal regarding the documentation: https://github.com/datahub-project/datahub/pull/11380

Masterchen09 commented 2 weeks ago

As the PR with the enhancements of the documentation was already merged, I will close this PR here. ;-)