Closed TejasRGitHub closed 4 months ago
Hi @TejasRGitHub - could this idea be an enhancement to the logic already in place in the ECS Scheduled Catalog Indexer Task which runs every 6 hours (/backend/dataall/modules/catalog/tasks/catalog_indexer_task.py
)?
This ECS task should be able to handle updates to data objects - maybe can extend the logic to also include deletes if some data objects no longer exists or a similar type of logic if required?
Hi @noah-paige , thanks for pointing that out. I think we can use this existing Catalog Indexer and the ECS to extend it to the datasets object. Currently I see that tables , folders and Dashboard are indexed and we could potentially just extend it to the dataset objects.
Although this itself would solve the problem of indexing the dataset objects and Re-index the Catalog. I was also thinking if it would be helpful to manually start this process of re-indexing from the UI. This button would only be visible to the data.all admins in which they could start the indexer if needed.
@noah-paige , @zsaltys , @anushka-singh , @rbernotas any thoughts on above ?
Hi @TejasRGitHub, I agree with using the ECS catalog indexer task and extend it to datasets. As a data.all admin, they can trigger the ECS task on demand directly in ECS (with ECS API commands). Do you think that is enough? Or should we add a UI functionality? Curious to hear other people's thoughts
Hi @dlpzx , I think we should have a UI for triggering this functionality on the fly. Also, a separate UI to delete indexes, update indexes would be good to have. Currently if you have to delete an index on serverless opensearch it is a tedious process of setting up EC2 to reach the Opensearch cluster. A UI which is only visible to admins, would be a lot helpful here I think
This feature to allow Admins to re-index the data.all Catalog has been implemented in PR #1365
It allows Admins to run re-index catalog tasks to sync catalog objects with data.all DB and optionally delete any orphaned resources on-demand
Closing this issue by EOD today - please do let us know if any additional follow ups or concerns
Is your idea related to a problem? Please describe. In data.all , whenever a change happens on a dataset, the indexer is ran and it automatically updates the Open search index.
There can be situations in which data.all admin manually edits/updates/ deletes records from RDS which pertain to the datasets. In this case, the catalog index is not updated.
Moreover, in case when a dataset is being mutated and some unintended error occurs which is not handled , few thing on the dataset might get updated and other don't . This might happen during testing or if some certain bugs are present. In this case, the data.all admin manually updates the opensearch index.
Describe the solution you'd like A way in data.all where in the data.all admin could easily re-sync/ update the open search index. This could be maybe a UI button which is only displayed to data.all admin OR could be a config in cdk.json where in during the deployment the open search is again updated and re-synced
P.S. Don't attach files. Please, prefer add code snippets directly in the message body.