data-dot-all / dataall

A modern data marketplace that makes collaboration among diverse users (like business, analysts and engineers) easier, increasing efficiency and agility in data projects on AWS.
https://data-dot-all.github.io/dataall/
Apache License 2.0
220 stars 77 forks source link

Redshift Data Sharing #955

Open anmolsgandhi opened 6 months ago

anmolsgandhi commented 6 months ago

Description:

Enable seamless data integration with Redshift as a new data source in ‘data.all’. This feature enhances collaboration by allowing users to easily publish, discover and share Redshift data within the data.all platform. Users can securely configure Redshift instance, streamlining the process of making Redshift datasets accessible.

Details:

Adding Redshift Instance and Publishing Tables

Tables Available for Discovery

Self-service Share Process for Redshift Data Sharing

Benefits:

@dlpzx

dlpzx commented 3 months ago

Design

Assumptions

HLD and User experience

image

Following the numeration above:

  1. Outside of data.all, Database Admin Teams manage Redshift cluster users.
    1. For data producers - They create a Redshift user (IAM:user) for their data producers that allows IAM federation or store the credentials of a Redshift database user in Secrets Manager
    2. For data consumers - They can create any type of user
  2. Outside of data.all, Database Admin Teams in the data producer and in the data consumer clusters create an IAM:user in Redshift for the data.all IAM pivot role
  3. Outside of data.all, Data producers work in Redshift and create tables
  4. In data.all UI, Data producers create a data.all Connection
    1. When creating a connection, users need to introduce:
      1. The Redshift user (IAM:user) IAM role or SecretArn created by their db admins
      2. Environment where the cluster is
      3. Namespace/cluster id
      4. A data.all Team that owns the connection. Only members of the Team can use it. (similar to consumption IAM roles)
    2. Connections are going to be used to AUTHORIZE the import of data and maybe in next steps to open Redshift QueryEditorV2. There are different types of Redshift users:
      1. Federated users (the IAM role is stored). The role created has permissions to be used as federated user in Redshift by data.all.
      2. AWS Secrets Manager (the secretArn is stored). Customers will need to tag the secret in order for data.all to be able to access it.
      3. NEXT STEPS - IAM Identity Center - it cannot be used at the moment for the publication of data.
      4. NEVER - username and password. From data.all we want to avoid securing passwords in transit.
  5. In data.all UI, Data producers import a Redshift dataset in data.all specifying:
    1. Select the Environment and the Connection to use for import
    2. The Team that owns the Connection also will own the Dataset
      1. Introduce pattern of tables to be imported - for example, import all tables whose name starts with view-* → We can implement this feature when we add table to the datashare.
  6. Under-the-hood, when a dataset is imported, data.all creates a datashare between Redshift and the Glue Catalog using the authorization of the Connection.
  7. In data.all UI, Data producers can click on “Sync tables” in the imported dataset as we do with S3/Glue datasets. Tables appear in data.all and are indexed in the central catalog. Users can ListDatasets, which lists S3 and Redshift datasets. Filters allow to select the type
  8. Under-the-hood, when the data producer clicks sync-tables, data.all reads from the glue database created as part of the datashare from Redshift to Glue Catalog
  9. In data.all UI, data consumers can discover RS tables and datasets in Catalog
  10. In data.all UI, data consumers create a data.all Redshift Consumption Role that stores:
    1. Redshift role name
    2. Namespace/cluster that the role belongs to
    3. Environment where the cluster is
    4. Team that owns the Redshift role
  11. In data.all UI, data consumers can create a share request by selecting the dataset or tables. They submit the request
    1. The principal of the share request will be a Redshift consumption Role
  12. In data.all UI, data producers approve the request
  13. Under-the-hood, data.all creates a datashare in the data producers cluster/namespace
  14. Under-the-hood, data.all associates the datashare to the data consumers cluster and grants permissions to the redshift role
  15. Data consumers will access the data through:
dlpzx commented 3 months ago

UPDATED BASED ON COMMENTS To implement the design I will open multiple pull requests (list might vary)

zsaltys commented 3 months ago

@dlpzx I've read through the design and watched your video as well (it was very helpful as it answered some of my questions).

Overall I don't see any big problems but I do have some concerns.

1) Addition of a new UI "Warehouses" to manage Redshift connections.I find this UI a bit awkward. My first instinct that this should be a TAB under an environment and not a separate UI outside an environment. Especially because you cannot have a connection that is not part of an environment. I think this would also simplify creating connections because then the environment is already pre-defined and you can also make the connection be owned by the same team that is creating the connection.

I would also want to make sure that there's a consistent user experience when registering consumer roles or redshift consumer connections. Even today I find it weird that we register consumer roles in "Teams" tab under environments. I don't think that's intuitive. Perhaps with the addition of redshift connections we can instead add a new tab on the environment "Consumer Connections" or smth similar where you can manage your consumer IAM roles and redshift consumer connections etc..

Also I don't really feel that this new type "Warehouses" is actually going to be reusable for anything else other than Redshift so I think it's misleading.

I would like to hear your arguments why you think it would be much better to put this as a new UI on the left main bar vs making it a new tab on the environment.

2) For sure make Redshift modular so that it can be fully disabled as for example we don't use redshift at all and don't want our users to be confused.

3) We need to check security. Absolutely make sure to scan all infrastructure with checkov and that the permissions are as tight as possible.

4) I'd really like to see part 2 of your video to understand better how Redshift consumer connections should work.

Thank you!

anushka-singh commented 3 months ago

I really like how descriptive the design is. Answered most of my questions too! I have a few pending though:

  1. Will a dataset be able to have s3, glue and redshift data? Will I be able to create such a dataset?
  2. Will the share UI be the same as the one being used today?
  3. Will all the other modules like QS, Sagemaker, Worksheets be available to use for Redshift too?
  4. Why are we calling it "Warehouses"? How is it any different from a data store like Glue or S3?
  5. Can you provide more information on how data consumers will interact with Redshift data using BI tools and SQL clients? Will consumers have to set up anything extra on their end to be able to use these tools?
dlpzx commented 3 months ago

Thanks @zsaltys and @anushka-singh for the input, you went straight to the tricky points.

DESIGN UPDATED WITH THE FEEDBACK!