airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.45k stars 3.98k forks source link

New Destination: Azure Synapse #2149

Closed freesmith closed 7 months ago

freesmith commented 3 years ago

Tell us about the new integration you’d like to have

Which source and which destination? Which frequency? Azure Datawarehouse (synapse) destination Anything from 30'-24h

Describe the context around this new integration

Which team in your company wants this integration, what for? This helps us understand the use case. Data team works with Azure therefore it is interesting for them have this as destination.

Describe the alternative you are considering or using

What are you considering doing if you don’t have this integration through Airbyte? We will use BigQuery although all the infrastructure is in Azure

Related issues

Related PR that had to be reverted

┆Issue is synchronized with this Asana task by Unito

michel-tricot commented 3 years ago

Thank you @freesmith

We are actually about to double down on our destination connectors. Out of curiosity, do you know if they have a JDBC driver available? That would make getting an MVP fairly quick for us.

freesmith commented 3 years ago

Apologies as I am no expert. I tried looking for answers but was not able to find a clear one. Sorry. I'd say yes based on this (https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/connect-overview) but again, not clear.

michel-tricot commented 3 years ago

No problem! let us take a look at it

eduardgruy commented 3 years ago

+1. We see Synapse getting traction. Please note that there are couple of solutions under the Azure Synapse workspace.

1st there is Synapse dedicated SQL pool this is comparable to AWS Redshift and you can also use a similar loading strategy. That means JDBC drivers, COPY statement and ADLS Gen2 as staging layer.

2nd solution is Synapse serverless pool which is comparable to AWS Athena. Basically it's a serverless query engine to which you can register external tables. To integrate you would need to provide data in Parquet/CSV,.. on ADLS gen2 and create an external table over it. In ideal scenario you would create 2 Delta lake tables (history & merged) in ADLS gen2 and register both of them as external tables on Synapse serverless pool, but:

3rd solution is Apache spark pool. This is similar to Databricks, but I think it lacks some Delta lake related features ( SQL API for deltalake like merge/delete/..? ). I think you could connect with JDBC/Apache Livy and use it to create, manipulate parquet/delta lake/CSV tables on ADLS Gen2.

Hope this will be helpful to someone.

sherifnada commented 3 years ago

@eduardgruy thanks a ton for the amazing context. This will be super helpful. We are bandwidth constrained at the moment but this is top of the list of priorities for us.

hammadhasandogar commented 2 years ago

+1 Synapse is used in enterprises using Azure stack for their data management, surprise to see that Synapse and ADLS are not part of hundreds of connectors available in Airbyte. @sherifnada - do you have timeline defined for Azure connectors? i.e. Synapse and ADLS.

sherifnada commented 2 years ago

@hammadhasandogar we bumped back Azure synapse and currently have no defined ETA. The fastest way to get this destination into Airbyte is via PR contribution at the moment!

hammadhasandogar commented 2 years ago

@sherifnada thanks for your reply, will hopefully share progress on this, since my team already started looking into it and very much interested on introducing Synapse and ADLS connectors.

jordandakota commented 2 years ago

+1. We see Synapse getting traction. Please note that there are couple of solutions under the Azure Synapse workspace.

1st there is Synapse dedicated SQL pool this is comparable to AWS Redshift and you can also use a similar loading strategy. That means JDBC drivers, COPY statement and ADLS Gen2 as staging layer.

2nd solution is Synapse serverless pool which is comparable to AWS Athena. Basically it's a serverless query engine to which you can register external tables. To integrate you would need to provide data in Parquet/CSV,.. on ADLS gen2 and create an external table over it. In ideal scenario you would create 2 Delta lake tables (history & merged) in ADLS gen2 and register both of them as external tables on Synapse serverless pool, but:

  • AFAIK Delta lake support is not yet done
  • you can't use DML statements in Synapse serverless so you would need to use different processing engine to move data from the history table to the merged one.

3rd solution is Apache spark pool. This is similar to Databricks, but I think it lacks some Delta lake related features ( SQL API for deltalake like merge/delete/..? ). I think you could connect with JDBC/Apache Livy and use it to create, manipulate parquet/delta lake/CSV tables on ADLS Gen2.

Hope this will be helpful to someone.

Since this was posted, the SQL Serverless implementation in Synapse has gone general availability / full support of Delta lake tables:

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-delta-lake-format https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/resources-self-help-sql-on-demand#delta-lake

You can't however use SQL Serverless to create or update delta tables, must use spark.

SQL Dedicated supports various drivers:

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/connect-overview#supported-drivers-and-connection-strings https://docs.microsoft.com/en-us/sql/connect/jdbc/microsoft-jdbc-driver-for-sql-server?toc=/azure/synapse-analytics/sql-data-warehouse/toc.json&bc=/azure/synapse-analytics/sql-data-warehouse/breadcrumb/toc.json&view=azure-sqldw-latest&preserve-view=true

ryjabe commented 1 year ago

@evantahler curious if this is on your radar, or still in the backlog?

evantahler commented 1 year ago

Thanks for the ping! We are currently focused on rebuilding some of the core architecture to allow for more flexibility & speed for how our destinations operate. Once that work is completed we will revisit adding additional DB sources.

RobLucchi commented 10 months ago

@evantahler Hello,

any temporary solution for the issue of Not enforcing the primary key?

mentioned here https://github.com/airbytehq/airbyte/issues/18408 or https://github.com/airbytehq/airbyte/issues/18297

lfigueira commented 7 months ago

Hi -

CC @evantahler @marcosmarxm

I am facing this issue when connecting to Synapse using the MSSQL connector. From 2149 it appears there is consensus that a new Synapse connector needs to be built. Am I correct? Has there been any progress on this?

evantahler commented 7 months ago

At this time, there are no plans to build a synapse connector. That said, we are about to release a big update for the general-case MSSQL connector which may help - stay tuned!

cc @rodireich