data-dot-all / dataall

A modern data marketplace that makes collaboration among diverse users (like business, analysts and engineers) easier, increasing efficiency and agility in data projects on AWS.
https://data-dot-all.github.io/dataall/
Apache License 2.0
228 stars 82 forks source link

Minimize the amount of unique glue dbs created when creating shares #746

Closed zsaltys closed 7 months ago

zsaltys commented 1 year ago

Is your idea related to a problem? Please describe. A typical data pipeline on AWS will use tools like AWS EMR (big data computation clusters) and AWS MWAA (job orchestrator). Both EMR and MWAA will run in AWS with their own IAM execution roles. These roles will need access to things like S3 or Glue tables. For example let's say I want to run a Spark job every hour which reads data from a table that has been shared with me via data.all. I will have MWAA running with it's own IAM role and MWAA will need to run requests against Glue to check if a partition is available before creating an EMR cluster. And the EMR cluster itself will run with an IAM role which will need access to S3 to read the table. The problem now is that we have 2 consumer roles that need access. Currently data.all requires that we submit two separate share requests for each consumer role. This works but creates a problem - every share creates it's own unique LakeFormation resource link which effectively means that in our application configuration we have to specify that MWAA must use database name X and EMR must use database name Y. And this problem gets compounded the more IAM roles you have.

Describe the solution you'd like

One option would be to submit a share request for multiple consumer roles at the same time and to be able to modify an existing share request and add or remove additional consumer roles so that only one resource link is created.

dlpzx commented 1 year ago

Hi @zsaltys, I like the solution that you proposed. Instead of doing changes to the LakeFormation/Glue sharing, we would add more requestors to the share request. This alternative is way easier to implement 🚀 Do you plan on taking this task?

zsaltys commented 10 months ago

@dlpzx Im glad you approve of the idea. I could in theory try to pick this up but I'm currently working on a few more important things. It would be nice if this could be considered for 2.2 release.

@noah-paige what are your thoughts on this feature?

dlpzx commented 10 months ago

Hi @zsaltys, definitely an interesting feature. We have added it as an stretch goal for v2.2. In case we cannot include it, it will be prioritize for v2.3. We are open to contributions and we can even co-implement it together. Whatever suits best, but we will try our best to include it as soon as possible.

zsaltys commented 9 months ago

@dlpzx @noah-paige I've been thinking more on this and I think I no longer think that support multiple roles on a share is the best way to solve this problem. The REAL problem Im trying to solve is that every share creates a new unique DB identifier... And this becomes very annoying because in a single application if you're consuming 5 tables that are part of 5 other datasets you will have 5 different db names. And if you need it for 2 roles then you will have a total of 10 db names... Keeping track of all these dbnames and copying them out of data.all is inconvenient and its very easy to make a mistake because you cannot tell from a config file that the db name ur using is actually the RIGHT one.

Also I do like the fact that for every role you make a separate share request because that documents who requested the share, when, why etc. All that is important.

And even if we support multiple roles on a share it won't really fully solve the current problem we have. Instead of 10 unique db names I will have 5. In our case what we have is that a data producer team has 5 buckets to keep their datasets but they have 1 database only for all their datasets. This works because each datasets filters out and only shows the tables that match the dataset bucket name.

What developers really want when they are making share requests is to not have N different unique db identifiers where you can't directly identify if they are pointing to the right thing directly from config! So this is the REAL problem we want to solve and find a solution for.. And I don't think that multiple roles per share will really solve that.

One option I'm thinking about is to reuse the actual db name which the dataset is using. Let's say I published a dataset and my glue db name is foo_bar. Then if envA asks for a share to it for roleA that would create a db with exact same name foo_bar on envA. And if roleB asks for a share then the db name is exactly the same foo_bar. Basically the shares would reuse the same DB! How it would work if a db does not exist then we create it on the env. If it already exists then we only create resource link table.. If that also exists then we only update permissions on LF! If we are removing a share then we only remove the resource link or the db if we're the last known user of this share otherwise only permissions get removed!

This ofc makes the implementation a bit more messy but it would solve the real problem fully. We could also go a different way. What if during a share we allowed users to use an existing database to create resource links in. We could even let users create new databases directly in data.all. That way if they create 5 different shares which use 5 different databases then the developer could bundle all of those shares into one single db share on his env for his particular application. The advantage of this approach would be perhaps cleaner implementation.

Let me know what you think..... :)

dlpzx commented 8 months ago

Hi @zsaltys, I completely understand the pain point and the proposal of having a single database in the target account and a single resource link per shared table. If i remember correctly @tvancasteren and his team also found it challenging to manage so many different databases in the target account.

By listing advantages and disadvantages I do not intend to discourage the implementation of this feature. My intention is to minimize the disadvantages in our design.

Advantages

Disadvantages

Other alternatives

The actual PAIN POINT root cause is that:

We already discussed having multi-bucket datasets. The main issue is that we would need to handle multiple bucket policies from a single dataset.

The other route that I thought we could go is, instead of grouping more things under the Dataset umbrella, we could do the opposite and break down the Dataset into smaller components. We define 2 types of datasets: "S3-Dataset" and "Glue dataset".

dlpzx commented 8 months ago

⚠️ ⚠️ [WIP]

Cross-account Design

Table approve share

We would need to modify the backend.dataall.modules.dataset_sharing.services.share_processors.lf_process_cross_account_share.ProcessLFCrossAccountShare.process_approved_shares function to perform the following:

  1. NO CHANGES: Grant ALL permissions to pivotRole for source database in source account
  2. Get share principals (requester IAM role and QS groups) and build shared db name: db name like <source_databasename>-shared adding the suffix shared to indicate that it is shared but without the unique identifier
  3. Create the shared database in target account if it doesn't exist: maybe we can split creation of the database and granting permissions to the principals. Creation happens one time, granting permissions happens in all requests
  4. For each shared table: a) NO CHANGES: update its status to SHARE_IN_PROGRESS with Action Start b) NO CHANGES: check if share item exists on glue catalog raise error if not and flag share item status to failed c) ** NO CHANGES: Grant external account (target account) access to table -> create RAM invitation and revoke_iamallowedgroups_super_permission_from_table d) * NO CHANGES: Accept pending RAM invitation e) create resource link for table in target account: we need to change it to be created only if it does not exist. f) NO CHANGES: grant permission to table for requester team IAM role in source account g) NO CHANGES: grant permission to resource link table for requester team IAM role in target account: in principle NO CHANGES, but we need to review h) NO CHANGES: update share item status to SHARE_SUCCESSFUL with Action Success

Table revoke share

For each revoked request item (table): a) NO CHANGES: update its status to REVOKE_IN_PROGRESS with Action Start b) NO CHANGES: check if item exists on glue catalog raise error if not and flag item status to failed c) * revoke table resource link: undo grant permission to resource link table for team role in target account: in principle NO CHANGES, but we need to review d) revoke source table access: undo grant permission to table for team role in source account (and for QS Group if no other shares present for table): make this behavior conditional if not other_table_shares_in_env: Revoke source table access only when there are no more shares on this table. It is a clean-up action. We need to be careful with concurrent shares. e) delete resource link table: same, as d) make it conditional. f) ** revoke_external_account_access_on_source_account: in principle NO CHANGES because it already checks for existing shares, but we need to review g) NO CHANGES: update share item status to REVOKE_SUCCESSFUL with Action Success

** 💡 Even if no changes are needed, this is our chance to use the latest features of LF v3 that allow directly sharing with external principals instead of sharing with the root account. *** 💡 Even if no changes are needed, this is our chance to take advantage of the AWS Organizations integrations with LF that simplify RAM workflows.


Same-account Design

If we leave it as it is, we would create a "-shared" database on the same account and grant permissions to that database. Another option is to grant permissions in the original database and avoid the creation of the "-shared" database.

The advantage of having a separated db is that it is easier to clean-up shares and it protects the original database. ❓ What do you think?

Decision

We have discussed it internally and we are inclined towards implementing a -shared database. In addition to the above mentioned advantages, having a shared database makes same-account and cross-account sharing almost equal, simplifying code management.


Backwards compatibility

1. Compatibility of existing shares and new shares (even on the same database)

Scenario: A new share from the same target and source accounts is filled for the same database but for other principals.

Theoretically, it should not present any issue as each share is working on a different target db. We just need to ensure that the grant/revoke external account (target account) access to table functions that send the RAM invitations are unaffected. This opens the door for additional enhancements on error handling in the sharing managers.

2. Compatibility of existing shares with new added items/revoked items to the share request

Scenario: Items are approved or revoked for an existing share after the code changes have happened.

We have different migration options [At this moment I am listing every possibility regardless of their implications]

  1. Forced migration on all share requests: at the time we upgrade the code we update all share requests. We revoke all and re-create all with the new sharing code. Downstream applications need to update the database name used and point at the new database. --> PROS: All shares are updated and handled in the same way at once --> CONS: It can cause issues and frustration because downstream applications might find out about the issue only when they get errors in their applications.

  2. Forced migration if there are updates (lazy-migration): at the time new items need to be processed, data.all verifies if it is an old share. If yes, it revokes all tables from that share and it re-creates the share with the new sharing design. ---> PROS: It migrates to the new sharing-mechanism only when needed. non-updated shares still work, but new ones and updated ones always use the new code. ---> CONS: It forces the user to upgrade, which means downstream application owners need to update their code.

  3. Revoke and re-create manual migration: data.all code ensures that old shares keep the old database-shared-uri even if new items are added and revoked. If users want to update their existing consumption applications and use database-shared they need to revoke and delete the share and create a new one. ---> PROS: Existing consumption applications can remain unchanged. Users decide whether or not they use the new sharing mechanism. ---> CONS: we maintain 2 versions of sharing = more code, more errors; Some shares might stay in the old way forever, which means we cannot get rid of the duplicated code.

❓ ❓ Can you think of other ways of handling existing shares? From the options above, what is your opinion?

Decision

After our internal discussions, we favor option 3 because it has the lightest impact on customer operations. We picture the migration process like the following:

One cool idea that can be re-used to communicate new features or maintenance actions would be to add a maintenance UI view or window where we can post messages and info about the current status of the platform etc..

[Update]

TejasRGitHub commented 7 months ago

@dlpzx , for the "Same-account Design", I agree that having a "-shared" db would be helpful as it would consolidate all shares ( cross and same ) and give a unified way of dealing with shares.

Regarding the "Compatibility of existing shares with new added items/revoked items to the share request" -> I think option 3 makes sense as it cause minimal impact. but as you said this will make us maintain two types of tables sharing. I also like option 2 with an addition of providing a warning or info text on the share UI , saying that the database name will be changing so please do the update in your downstream applications after the sharing is complete. Thus any new shares will be with the new logic and they can keep using the old shared db names till the time they decide to modify/create new share items.

I am curious to know more about the maintainence UI view / window and where we can add it.

noah-paige commented 7 months ago

Added a new issue to track maintenance UI view #998 - out of scope for this issue / PR but a good to have for upcoming release