data-dot-all / dataall

A modern data marketplace that makes collaboration among diverse users (like business, analysts and engineers) easier, increasing efficiency and agility in data projects on AWS.
https://data-dot-all.github.io/dataall/
Apache License 2.0
235 stars 82 forks source link

Can we leverage any Data Source for Glue Tables behind the scenes? #103

Closed pascalwhoop closed 2 years ago

pascalwhoop commented 2 years ago

Is it possible that we can leverage any of the sources defined by Glue? https://docs.aws.amazon.com/glue/latest/dg/populate-data-catalog.html

I.e. can we make an RDS DB Table available in the catalog through Glue? If not, what needs to happen to tap into this Glue abstraction in Data.All?

dlpzx commented 2 years ago

Hi @pascalwhoop, at the moment data must be stored in S3. A data.all Dataset consists of both S3 bucket + Glue database. We use S3 because is the center-piece of AWS data lakes connecting multiple data sources:

image

For the specific case of RDS, you could implement something like the step 3 of this blog: https://aws.amazon.com/blogs/big-data/integrating-aws-lake-formation-with-amazon-rds-for-sql-server/. The other steps would be handled by data.all creation or import of datasets.

But, we are open to direct storage in other data sources. it is definitely something that raises interest. So, to the question: "what needs to happen to tap into this Glue abstraction in Data.All?"

Assumptions:

User experience: As a data.all user with access to the RDS environment account and the RDS DB

  1. I click on import dataset and specify the storage type as RDS and I give my RDS database as input
  2. When the data.all object has been created I can click on crawl dataset and sync tables. They are now available in the catalog

Implementation:

Keep in mind that those are drafted steps and could of course, change. Maybe you can define the assumptions and user experience and we can refine the implementation steps.

pascalwhoop commented 2 years ago

Hi @dlpzx thx for paying attention to this. I think your user experience is pretty spot-on.

| As a use case owner with an RDS hosted in my account, I want to expose some of my tables as data products to other teams easily

Particularly the experience would be

  1. create new dataset in the catalog
  2. select my environment
  3. select my database
  4. create dataset
  5. wait for objects to be spun up
  6. ability to select schema / tables to expose

This is a separate topic but I think it may also be a good UX if one gets a "waiting screen" within the same journey that lets me wait for the data.all objects instead of breaking the journey into 1) create dataset 2) add data to it.

Assumptions wise, I am not deep enough in the tech to understand what the RDS needs to be configured like for the glue crawler to be able to access the DB. I suppose there is some form of credentials handling that needs to happen?

github-actions[bot] commented 2 years ago

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.