data-dot-all / dataall

A modern data marketplace that makes collaboration among diverse users (like business, analysts and engineers) easier, increasing efficiency and agility in data projects on AWS.
https://data-dot-all.github.io/dataall/
Apache License 2.0
232 stars 82 forks source link

Automatic migration for custom features like custom confidentiality mapping , etc #1077

Closed TejasRGitHub closed 3 months ago

TejasRGitHub commented 8 months ago

Is your idea related to a problem? Please describe. Feature - https://github.com/data-dot-all/dataall/pull/1049 introduces custom mapping for confidentiality levels. When a user switches to this custom mapping , the previously present confidentiality levels remain the same . Then the user will have to go to each dataset and edit the dataset to reflect the new confidentiality level. Another way would be to the update the RDS database with the new confidentiality level with a script. But with this the Catalog index won't get updated. Also, this process is prone to error.

Describe the solution you'd like For such custom resources , migration could happen on the dataset stack i.e. when the stack updater runs, we could check if the custom mapping exists and update the existing records and also update the catalog

This could be extended in such a way that a cdk config can be introduced like update_stacks_with_custom_resources and if this is set to True then all places wherever custom features are used and are enabled then this will update DB, or other places accordingly.

P.S. Don't attach files. Please, prefer add code snippets directly in the message body.

noah-paige commented 8 months ago

Please correct my understanding if I am mistaken:

The proposed solution would then be to add some additional logic based on config.json values to update data.all resources. In the scenario above related to PR #1049, the stacks updater would:

I am thinking through if a feature as described above would be generally useful or not. For instance, I am wary if this is one-time migration script that is only related to #1049 PR then it will likely just introduce technical debt later on. However, if it is designed with the intention to be used generally and we see additional applications for the above logic then it could be a nice enhancement

...curious to hear some other thoughts as well @dlpzx @petrkalos @SofiaSazonova @TejasRGitHub

TejasRGitHub commented 8 months ago

Hi @noah-paige , Thanks for your comments.

Yes you are right. The config could be introduced in cdk.json - update_stacks_with_custom_resources. And based on the config the custom updating part will be run . This custom updating part would then refer to config.json to fetch the custom mapping and then do the update accordingly.

I understand your concern in which it might introduce technical debt. I think in the future this might be helpful in case custom things are added and there is a need to . ( for example, for topics ,a custom list of topics could be introduced and then this would be helpful in migrating it ) . But also as this process is just needed one time, this would run each time the stacks updater runs ( unless the code is cdk.json is updated to make update_stacks_with_custom_resources false again and redeployed ) which is not a good thing.

I would like to hear if there are better ways of doing this and also if this is something that needs to be incorporated in data.all in the first place.

dlpzx commented 7 months ago

Hello @TejasRGitHub, thanks for opening an issue. In my opinion and automation should be considered in the case customers need to migrate catalog indexed metadata (RDS migrations are solved through alembic backfilling) often and it is not a one time operation. In addition, it is a very custom feature, every customer might have their own migration behavior. E.g. one customer might merge 2 labels into 1 label, just renames, deletes of a label... which adds complexity to the feature.

I would approach this issue from a different angle. The problem is not that it is not possible to migrate the labels, the problem is that we have to do it manually. I think we could use the CLI/SDK feature to provide runbooks and examples on automating this manual tasks. Maybe opening a repo "dataall-sdk-cli-samples" What do you think?

TejasRGitHub commented 7 months ago

Hi @dlpzx , I can see how complex it might get.

Your approach of using the CLI /SDK for data.all and then use them to upgrade would work. A "dataall-sdk-cli-samples" would be a great place to keep this , but will have to add readme / instructions to use the code in this repo to migrate, which is fine.

Can you please share the link to the SDK/CLI code

dlpzx commented 3 months ago

This issue will be closed soon due to inactivity. Referencing #950 to keep it in mind when designing runbooks and samples

TejasRGitHub commented 3 months ago

Yes, this could be closed. We were able to do a one-time migration. But I absolutely think such tasks would be easily handled with data.all CLI and SDK