data-dot-all / dataall

A modern data marketplace that makes collaboration among diverse users (like business, analysts and engineers) easier, increasing efficiency and agility in data projects on AWS.
https://data-dot-all.github.io/dataall/
Apache License 2.0
220 stars 77 forks source link

tag-based data sharing #186

Open dlpzx opened 1 year ago

dlpzx commented 1 year ago

Is your feature request related to a problem? Please describe. I am a data consumer that requires access to a number of databases and tables tagged as "confidentiality = open". But currently I need to go to open a share request to each database and request access to the tables (using LF named-resources sharing under-the-hood)

Describe the solution you'd like I would like an easier way to request access to all those "tagged" resources at once.

Describe alternatives you've considered There are 3 main points to consider: who is responsible of managing the tags? How do we define which tags are granted to a group? How do we define which tags have access to a dataset?

  1. Creation and ownership of tags . --> Tenants or Orga Admins: controlled and limited set of LF-tags managed centrally and created in all environments. Easier to implement and manage. --> In the environment or in the glossaries: more complex to manage and to implement. But more flexible
  2. Granting groups permission to a tag. Options: --> Either at the moment of inviting a team we can assign them LF-tags --> they can also request access to a LF-tag from the data.all catalog
  3. Granting access to datasets with tags --> the dataset owner team is responsible of tagging the dataset with LF-tags. But they should have visibility on the teams that have access to the tags

Data.all LF-tags initial design

It is based on the Blogpost: https://aws.amazon.com/blogs/big-data/securely-share-your-data-across-aws-accounts-using-aws-lake-formation/

SETUP

  1. Tenants define the list of LF-Tags in the tenant window. LF-Tag information is stored in RDS new table = “lftags”
  2. When a new environment is created, all LF-Tags are created in the environment. When a new tag is added ALL environment stacks are updated to create the new tag.
  3. When a new environment is created the LF-tags of all the pre-existing environments are shared with the new environment account.
  4. And the other way around, the tag created in the new environment is shared with all the pre-existing environments
  5. When a new environment is created the DATA THROUGH LF-tags of all the pre-existing environments are shared with the new environment account. (Grant data permission to the consumer account)
  6. And the other way around, the DATA THROUGH LF-tag created in the new environment is shared with all the pre-existing environments (Grant data permission to the consumer account)
  7. Glue Catalog settings policy must be updated for all environments to include all environment accounts.

image

Grant Access to Team

[Image: image.png]From Data.all a particular TEAM gets access to a LF-Tag: HOW/WHO GRANTS? Alternatives:

This information is stored in RDS “lftagspermissions” columns:group/environment/awsaccount/tag/value After the TEAM gets granted access to a LF-tag=value with one of the above alternatives:

Tag Dataset

[Image: image.png]The Datasets are tagged in data.all by the DATASET OWNERS (ownership and responsibility to the owners) *Important: they need to know who has access to that LF-tag=value. Here we need to list LF-tag=value granted groups (this is accessible querying the table “lftagspermissions”

Once the dataset is tagged:

image

Additional context This is only a draft, please feel free to comment

dlpzx commented 1 year ago

Check out the latest LF announcements: https://aws.amazon.com/about-aws/whats-new/2022/11/cross-account-sharing-direct-iam-principals-sharing-organization-units-lf-tbac-lake-formation/

anmolsgandhi commented 1 month ago

We have been seeing demand signals for this feature. As a result, we are scoping it as part of v2.7. The scope for this release will involve initiating the ideation process, adjusting the design to incorporate the latest capabilities from LF wrt to tag based sharing, and potentially starting the implementation if time permits. The description and problem statement may evolve as an outcome of our discovery as well. However, it's important to note that the actual release of this feature will likely not be part of the v2.7 timeframe, but rather targeted for a future release @SofiaSazonova @dlpzx

anmolsgandhi commented 2 weeks ago

Bumping this feature request to the v2.7 roadmap. We've received some strong demand signals from customers for fine-grained access control functionality. The plan for v2.7 is to re-evaluate the work that's been done so far as part of this existing issue and finalize the design. Additionally, we'll be opening a separate issue specifically for data filters, covering column-level and row-level data access control which will be part of v2.7 as well.

zsaltys commented 2 weeks ago

@anmolsgandhi @dlpzx let's update the design as it seems the original ticket was created back in 2022. I'd really like to understand what and how we plan to build here. It basically sounds to me more like a feature to request access in bulk. I think perhaps I would suggest a different direction here...

Let's figure out how we can allow data.all to get access to multiple datasets at once. For example each dataset/table in the catalog UI having a checkbox to select multiple things. There can also be a button to select ALL. Then if we make LF tags be part of the OpenSearch index I could search datasets by a TAG, click select all and then do a bulk submit to request access all at once... We obviously also want to make sure that it's easy for the user and all the requests are immediately submitted and dont become drafts.

anmolsgandhi commented 2 weeks ago

@zsaltys The goal for v2.7 of this feature is to begin investigating the latest developments with AWS Lake Formation's tag-based access controls. This will likely involve updating the design and understanding what that means for data.all. The plan is to align and finalize the design in v2.7 for implementation in future releases.

A separate issue will be opened to track the enhancement and extension of the current sharing functionality to include column-level and row-level access control in Lake Formation. This has been one of the top requests from potential customers based on conversations. we will likely prioritize that as a feature enhancement for completion in v2.7 as well.

Thanks for the suggestions on potential functionality and user experience, we will circle back when you pick this up cc: @SofiaSazonova @dlpzx @noah-paige