data-dot-all / dataall

A modern data marketplace that makes collaboration among diverse users (like business, analysts and engineers) easier, increasing efficiency and agility in data projects on AWS.
https://data-dot-all.github.io/dataall/
Apache License 2.0
232 stars 82 forks source link

Re-index Open search catalog #1078

Closed TejasRGitHub closed 4 months ago

TejasRGitHub commented 8 months ago

Is your idea related to a problem? Please describe. In data.all , whenever a change happens on a dataset, the indexer is ran and it automatically updates the Open search index.

There can be situations in which data.all admin manually edits/updates/ deletes records from RDS which pertain to the datasets. In this case, the catalog index is not updated.

Moreover, in case when a dataset is being mutated and some unintended error occurs which is not handled , few thing on the dataset might get updated and other don't . This might happen during testing or if some certain bugs are present. In this case, the data.all admin manually updates the opensearch index.

Describe the solution you'd like A way in data.all where in the data.all admin could easily re-sync/ update the open search index. This could be maybe a UI button which is only displayed to data.all admin OR could be a config in cdk.json where in during the deployment the open search is again updated and re-synced

P.S. Don't attach files. Please, prefer add code snippets directly in the message body.

noah-paige commented 8 months ago

Hi @TejasRGitHub - could this idea be an enhancement to the logic already in place in the ECS Scheduled Catalog Indexer Task which runs every 6 hours (/backend/dataall/modules/catalog/tasks/catalog_indexer_task.py)?

This ECS task should be able to handle updates to data objects - maybe can extend the logic to also include deletes if some data objects no longer exists or a similar type of logic if required?

TejasRGitHub commented 8 months ago

Hi @noah-paige , thanks for pointing that out. I think we can use this existing Catalog Indexer and the ECS to extend it to the datasets object. Currently I see that tables , folders and Dashboard are indexed and we could potentially just extend it to the dataset objects.

Although this itself would solve the problem of indexing the dataset objects and Re-index the Catalog. I was also thinking if it would be helpful to manually start this process of re-indexing from the UI. This button would only be visible to the data.all admins in which they could start the indexer if needed.

@noah-paige , @zsaltys , @anushka-singh , @rbernotas any thoughts on above ?

dlpzx commented 7 months ago

Hi @TejasRGitHub, I agree with using the ECS catalog indexer task and extend it to datasets. As a data.all admin, they can trigger the ECS task on demand directly in ECS (with ECS API commands). Do you think that is enough? Or should we add a UI functionality? Curious to hear other people's thoughts

TejasRGitHub commented 6 months ago

Hi @dlpzx , I think we should have a UI for triggering this functionality on the fly. Also, a separate UI to delete indexes, update indexes would be good to have. Currently if you have to delete an index on serverless opensearch it is a tedious process of setting up EC2 to reach the Opensearch cluster. A UI which is only visible to admins, would be a lot helpful here I think

noah-paige commented 4 months ago

This feature to allow Admins to re-index the data.all Catalog has been implemented in PR #1365

It allows Admins to run re-index catalog tasks to sync catalog objects with data.all DB and optionally delete any orphaned resources on-demand

Closing this issue by EOD today - please do let us know if any additional follow ups or concerns