Description:

Enable seamless data integration with Redshift as a new data source in ‘data.all’. This feature enhances collaboration by allowing users to easily publish, discover and share Redshift data within the data.all platform. Users can securely configure Redshift instance, streamlining the process of making Redshift datasets accessible.

Details:

Adding Redshift Instance and Publishing Tables

Users initiate the process by selecting “Create Dataset” and choosing Redshift from the dropdown menu.
The interface guides users through a secure credential input, ensuring a streamlined and secure configuration process.
Once configured, the dataset owners can select specific tables to publish to the ‘data.all’ catalog, ensuring a controlled inclusion of Redshift data.

Tables Available for Discovery

Cataloged Redshift tables automatically become part of the ‘data.all’ catalog, visible to users exploring datasets within the platform.
The catalog provides detailed metadata for each table, facilitating a comprehensive understanding of available data.
Users can navigate the ‘data.all’ UI to effortlessly discover and explore Redshift tables
Dataset owners can edit metadata for each table such as description, tags.

Self-service Share Process for Redshift Data Sharing

Consumers interested in specific Redshift tables initiate the share process by selecting the desired dataset.
Owners of the shared Redshift tables within data.all Datasets receive access requests, with an easy-to-use interface for managing permissions and approvals.
Upon approval, the shared Redshift data becomes dynamically accessible to consumers, maintaining a consistent and user-friendly experience.

Benefits:

Additional Data Source Integration: The added capability of Redshift as a new data source enhances flexibility, enabling users to integrate diverse data sources beyond S3, expanding the platform’s utility.
User-Friendly Configuration: A guided process ensures a connection of Redshift instances with secure credentials.
Efficient Discovery: Automated cataloging promotes effortless exploration of Redshift tables within ‘data.all’ catalog.
Streamlined Sharing Workflow: The self-service share process maintains simplicity and consistency across different types of data, allowing users to request and access Redshift data seamlessly as they do with S3 data.

@dlpzx

Design

Assumptions

Redshift clusters/namespaces are created and maintained by DevOps teams outside of data.all
Database admin teams manage users in their clusters/namespaces outside of data.all
Data producers and consumers can access their clusters/namespaces with the access provided by the database admin teams.
Data producers create tables in Redshift outside of data.all
Data.all requires a Redshift user of the type IAM:user or database user with credentials stored in AWS Secrets Manager for the data producers that are going to publish data ( In the diagram this is the basis for Authorization 1). Data.all needs to have permissions to use the IAM role or to access the Secret. This user needs to have permissions to create datashares.
Data.all requires a Redshift user of the type IAM:user for the data.all PivotRole in all accounts with a Redshift cluster. This user needs to have permissions to create datashares. In the diagram this is the basis for Authorization 2 and 3
data.all Share request principal will be REDSHIFT ROLE
Data Consumers register their Redshift roles with Redshift Consumption Roles. Database admins can control the roles created in Redshift which roles are attached to which user/group. To isolate data.all access grants from other access grants, we recommend database admins to create dedicated Redshift roles. For example, for projectXYZ a group of Redshift users needs permissions to data in another cluster. The database admin should create a Redshift role DAProjectXYZ and attach it to the roles/users/groups in RS. Data consumers should register the role in data.all and request access to the data they need.

HLD and User experience

Following the numeration above:

Outside of data.all, Database Admin Teams manage Redshift cluster users.
1. For data producers - They create a Redshift user (IAM:user) for their data producers that allows IAM federation or store the credentials of a Redshift database user in Secrets Manager
2. For data consumers - They can create any type of user
Outside of data.all, Database Admin Teams in the data producer and in the data consumer clusters create an IAM:user in Redshift for the data.all IAM pivot role
Outside of data.all, Data producers work in Redshift and create tables
In data.all UI, Data producers create a data.all Connection
1. When creating a connection, users need to introduce:
  1. The Redshift user (IAM:user) IAM role or SecretArn created by their db admins
  2. Environment where the cluster is
  3. Namespace/cluster id
  4. A data.all Team that owns the connection. Only members of the Team can use it. (similar to consumption IAM roles)
2. Connections are going to be used to AUTHORIZE the import of data and maybe in next steps to open Redshift QueryEditorV2. There are different types of Redshift users:
  1. Federated users (the IAM role is stored). The role created has permissions to be used as federated user in Redshift by data.all.
  2. AWS Secrets Manager (the secretArn is stored). Customers will need to tag the secret in order for data.all to be able to access it.
  3. NEXT STEPS - IAM Identity Center - it cannot be used at the moment for the publication of data.
  4. NEVER - username and password. From data.all we want to avoid securing passwords in transit.
In data.all UI, Data producers import a Redshift dataset in data.all specifying:
1. Select the Environment and the Connection to use for import
2. The Team that owns the Connection also will own the Dataset
  1. Introduce pattern of tables to be imported - for example, import all tables whose name starts with view-* → We can implement this feature when we add table to the datashare.
Under-the-hood, when a dataset is imported, data.all creates a datashare between Redshift and the Glue Catalog using the authorization of the Connection.
In data.all UI, Data producers can click on “Sync tables” in the imported dataset as we do with S3/Glue datasets. Tables appear in data.all and are indexed in the central catalog. Users can ListDatasets, which lists S3 and Redshift datasets. Filters allow to select the type
Under-the-hood, when the data producer clicks sync-tables, data.all reads from the glue database created as part of the datashare from Redshift to Glue Catalog
In data.all UI, data consumers can discover RS tables and datasets in Catalog
In data.all UI, data consumers create a data.all Redshift Consumption Role that stores:
1. Redshift role name
2. Namespace/cluster that the role belongs to
3. Environment where the cluster is
4. Team that owns the Redshift role
In data.all UI, data consumers can create a share request by selecting the dataset or tables. They submit the request
1. The principal of the share request will be a Redshift consumption Role
In data.all UI, data producers approve the request
Under-the-hood, data.all creates a datashare in the data producers cluster/namespace
Under-the-hood, data.all associates the datashare to the data consumers cluster and grants permissions to the redshift role
Data consumers will access the data through:

BI tools: Quicksight, Tableau, Power BI, Qlik (JDBC/ODBC connections)
SQL clients: DB Beaver, SQL Workbench (JDBC/ODBC connections)
ETL workloads in Redshift
Ad-hoc queries in Redshift Query Editor

UPDATED BASED ON COMMENTS To implement the design I will open multiple pull requests (list might vary)

[X] Done Pre-reqs: Refactor current datasets into S3 Datasets and Base datasets (#1123)
[x] In-Progress Pre-reqs: Refactor current dataset sharing into S3 sharing and base sharing (#1283)
[ ] In-Progress New Redshift Dataset module using Base datasets + publish to catalog logic. Introduce Redshift Connections
[ ] Not Started New Redshift data sharing module using base sharing

@dlpzx I've read through the design and watched your video as well (it was very helpful as it answered some of my questions).

Overall I don't see any big problems but I do have some concerns.

1) Addition of a new UI "Warehouses" to manage Redshift connections.I find this UI a bit awkward. My first instinct that this should be a TAB under an environment and not a separate UI outside an environment. Especially because you cannot have a connection that is not part of an environment. I think this would also simplify creating connections because then the environment is already pre-defined and you can also make the connection be owned by the same team that is creating the connection.

I would also want to make sure that there's a consistent user experience when registering consumer roles or redshift consumer connections. Even today I find it weird that we register consumer roles in "Teams" tab under environments. I don't think that's intuitive. Perhaps with the addition of redshift connections we can instead add a new tab on the environment "Consumer Connections" or smth similar where you can manage your consumer IAM roles and redshift consumer connections etc..

Also I don't really feel that this new type "Warehouses" is actually going to be reusable for anything else other than Redshift so I think it's misleading.

I would like to hear your arguments why you think it would be much better to put this as a new UI on the left main bar vs making it a new tab on the environment.

2) For sure make Redshift modular so that it can be fully disabled as for example we don't use redshift at all and don't want our users to be confused.

3) We need to check security. Absolutely make sure to scan all infrastructure with checkov and that the permissions are as tight as possible.

4) I'd really like to see part 2 of your video to understand better how Redshift consumer connections should work.

Thank you!

I really like how descriptive the design is. Answered most of my questions too! I have a few pending though:

Will a dataset be able to have s3, glue and redshift data? Will I be able to create such a dataset?
Will the share UI be the same as the one being used today?
Will all the other modules like QS, Sagemaker, Worksheets be available to use for Redshift too?
Why are we calling it "Warehouses"? How is it any different from a data store like Glue or S3?
Can you provide more information on how data consumers will interact with Redshift data using BI tools and SQL clients? Will consumers have to set up anything extra on their end to be able to use these tools?

Thanks @zsaltys and @anushka-singh for the input, you went straight to the tricky points.

@zsaltys Regarding point 1, initially I placed it inside environments, but then I questioned if we even needed to place a warehouse inside an environment - let's say you are using Snowflake and it is not linked to an AWS account. What we can do is to place it inside environments, because I agree that the user experience is nicer that way. But then if we need to link other Warehouses with non-AWS links, we can work on creating non-AWS-data.all Environments (something that opens the door to multi-cloud....). In short, happy to change it. 2 - absolutely, 3 - let's prioritize for 2.5, 4 - i have not recorded it yet, i have been focusing in #1123 the last week. Please have a look
@anushka-singh thanks for the questions! I think you need to have a look at #1123 for the questions 1 and 2. The idea is to have a generic Dataset model and specific Dataset classes that inherit this model. Instead of adding functionalities to the existing Dataset module, we have opted to make it extensible. For question 2 - yes, very similar, but we need to check the details
For question 3, we would need to check case-by-case what is the integration: for Quicksight, how does the data sharing work, for SageMaker, if there is any library to connect with a redshift user or with IAM:role federation then they can access the data. Worksheets depends on the Athena connectors, in this last case we would need to see if it is worthy or we can open the RS Query Editor
I called it Warehouses with the idea of making it abstract to other warehousing technologies (also outside AWS)
For 5, most probably. I will add more details

DESIGN UPDATED WITH THE FEEDBACK!

data-dot-all / dataall

Redshift Data Sharing #955