airbytehq / connector-contest

Contribute a connector to open-source Airbyte and win prizes!
25 stars 7 forks source link

Destination: S3 #144

Closed blarghmatey closed 1 year ago

blarghmatey commented 1 year ago

The S3 destination is very useful as a component in building a data lake, but in order to make it usable in a lakehouse style architecture it is missing integration with the metastore for whichever query engine(s) power the SQL interface. For AWS that is likely to be the Glue data catalog, but the other targets would include the Apache Hive Metastore or Nessie. It would be massively helpful to integrate with the metastore as part of the S3 write/reset operations so that the table schema data in the metastore is kept synchronized with the schema that is written by Airbyte.

itaseskii commented 1 year ago

Hey @blarghmatey Is an explicit integration from the connector necessary for updating the metastore? From initial reading it seems that AWS Glue has schema inferring jobs that can be scheduled to crawl the s3 data and update the catalog after which the data can be read, parsed and queried.

blarghmatey commented 1 year ago

Thank you for the question. While Glue does have the option to use a crawler to populate that schema information, there are couple of primary issues that come out of that.

While my personal interest lies in integration with Glue, I think it would be worthwhile to build this with an interface that allows for plugging in an implementation for an arbitrary metastore so that anyone who isn't using Glue, or isn't even on AWS, can still take advantage of the workflow where as soon as data is synced via Airbyte it is immediately queryable using whichever engine they rely on. The most obvious next target is the Hive Metastore, but project Nessie is another system that is being built with that use case in mind and there are likely to be other implementations of that category of utility.

Feel free to add any further/clarifying questions if my answer is unclear or if there is a more holistic way that we might approach this. (It's likely that users of the Azure and Google object storage destinations would benefit from similar functionality)

RealChrisSean commented 1 year ago

Hi @blarghmatey did you want me to assign this ticket to you?

blarghmatey commented 1 year ago

Thanks for the question @RealChrisSean. I don't have the time or context at the moment to take on this work, so I'm hoping that someone will be able to help with the implementation. I'm happy to help work through the details and test the implementation.

itaseskii commented 1 year ago

After some discussion with @blarghmatey on what the scope of this task is I will assign this to myself and start working on it ASAP since it is relatively urgent tasks.

RealChrisSean commented 1 year ago

Hi @itaseskii I have assigned this ticket to you. Good luck. :)