Can we leverage any Data Source for Glue Tables behind the scenes?

pascalwhoop commented 2 years ago

Is it possible that we can leverage any of the sources defined by Glue? https://docs.aws.amazon.com/glue/latest/dg/populate-data-catalog.html

I.e. can we make an RDS DB Table available in the catalog through Glue? If not, what needs to happen to tap into this Glue abstraction in Data.All?

dlpzx commented 2 years ago

Hi @pascalwhoop, at the moment data must be stored in S3. A data.all Dataset consists of both S3 bucket + Glue database. We use S3 because is the center-piece of AWS data lakes connecting multiple data sources:

For the specific case of RDS, you could implement something like the step 3 of this blog: https://aws.amazon.com/blogs/big-data/integrating-aws-lake-formation-with-amazon-rds-for-sql-server/. The other steps would be handled by data.all creation or import of datasets.

But, we are open to direct storage in other data sources. it is definitely something that raises interest. So, to the question: "what needs to happen to tap into this Glue abstraction in Data.All?"

Assumptions:

data is already stored in RDS DB in one of the environment AWS accounts
WE DO NOT HANDLE TABLE SHARING

User experience: As a data.all user with access to the RDS environment account and the RDS DB

I click on import dataset and specify the storage type as RDS and I give my RDS database as input
When the data.all object has been created I can click on crawl dataset and sync tables. They are now available in the catalog

Implementation:

Modification of "Import dataset form" UI to add new input parameters (type of storage)
Modification of input of API and API import dataset code
Modification of RDS metadata database and the table for datasets
Modification of dataset stack (in cdkproxy)
Modification of crawler to allow RDS: https://docs.aws.amazon.com/glue/latest/dg/crawler-data-stores.html

Keep in mind that those are drafted steps and could of course, change. Maybe you can define the assumptions and user experience and we can refine the implementation steps.

pascalwhoop commented 2 years ago

Hi @dlpzx thx for paying attention to this. I think your user experience is pretty spot-on.

| As a use case owner with an RDS hosted in my account, I want to expose some of my tables as data products to other teams easily

Particularly the experience would be

create new dataset in the catalog
select my environment
select my database
create dataset
wait for objects to be spun up
ability to select schema / tables to expose

This is a separate topic but I think it may also be a good UX if one gets a "waiting screen" within the same journey that lets me wait for the data.all objects instead of breaking the journey into 1) create dataset 2) add data to it.

Assumptions wise, I am not deep enough in the tech to understand what the RDS needs to be configured like for the glue crawler to be able to access the DB. I suppose there is some form of credentials handling that needs to happen?

github-actions[bot] commented 2 years ago

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

data-dot-all / dataall

Can we leverage any Data Source for Glue Tables behind the scenes? #103

⚠️COMMENT VISIBILITY WARNING⚠️