airbytehq / write-for-the-community

Contribute and collaborate on educational content for the Airbyte Community.
MIT License
42 stars 8 forks source link

Replicate Microsoft SQL Server Database To Postgres Using CDC #173

Open blacksheepanalytics113 opened 1 year ago

blacksheepanalytics113 commented 1 year ago

Replicating data from Microsoft SQL Server (MSSQL) to PostgreSQL with Airbyte provides faster analytical queries. Data analytics need dedicated compute resources. When processing terabytes to petabytes of data for the purpose of analytics, SQL server may be slow and expensive due to its per socket, CPU-based billing model. Moving to Redshift significantly reduces processing time and running costs.

Airbyte Cloud allows you to seamlessly move data between any data source and destination, including popular databases, data warehouses, and business applications. Airbyte's database replication to datawarehouses uses change data capture (CDC) with checkpointing capabilities and scheduling to simply pick up from where you left off.

This tutorial will take you through the critical steps to set up Airbyte Cloud and replicate data from your SQL Server instance running in PostgreSQL.

CDC pipeline to move data from Microsoft SQL to PostgreSQL

Airbyte’s Microsoft SQL Server CDC source extracts the orders in Microsoft SQL Server using Change Data Capture (CDC).

A relational database like Microsoft SQL Server maintains a transaction log to record every state-changing operation such as inserts, updates, and deletes. CDC mechanism tails this transaction log to detect the database entities that have been changed and streams them as change events. That way, we can obtain incrementally updated orders to avoid costly full table extractions.

Airbyte runs this extraction at a scheduled interval, for example, every hour, day, week, or so. Let’s schedule it for every five minutes to get more fresh data. Extracted orders are written to a PostgreSQL Database.

Prerequisites Below are the prerequisite tools you’ll need to get started on replicating up your SQL Server data to PostgreSQL.

  1. You’ll need to get an Airbyte Cloud account to replicate the data. You can sign up for Airbyte Cloud here.
  2. You will need an instance of SQL Server that you can connect to remotely. You can get a hosted SQL Server instance through
  3. Digital Ocean Droplet at the link here. You will also need an instance of PostgreSQL that you can connect to remotely. You can get started with PostgreSQL by creating an Digital Ocean account at the link here.

Step 1: Set up SQL Server In Digital Ocean Droplet

In this example, we will configure a new SQL Server instance hosted on Digital Ocean as our Airbyte Cloud Source. Log into Digital Ocean Account and go to Droplet > Launch Console Screenshot (142)

The article assumes you have Docker Compose installed on your machine. For better performance, it is recommended to have at least 4GB of RAM and adequate disk space.

Screenshot (144)

Install MSSQL On Digital Ocean Using Docker & Docker Compose

  1. Pull the SQL Server container image from the Docker Hub by running the following command: docker pull mcr.microsoft.com/mssql/server

  2. Run a container from the SQL Server image by executing the following command: docker run -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=YourPassword' -p 1433:1433 --name sql_server_container -d mcr.microsoft.com/mssql/server

    Replace YourPassword with a strong password for the SQL Server system administrator (sa) account. The command will create and start a container named sql_server_container using the SQL Server image.

  3. Wait for a few moments to allow the container to initialize and start the SQL Server service.

  4. Verify that the container is running by executing the following command: docker ps

You should see the sql_server_container listed along with its details. Congratulations! You have successfully installed SQL Server using Docker on Ubuntu. The SQL Server instance is accessible on port 1433, and you can connect to it using SQL Server Management Studio (SSMS) or any other SQL client tool by providing the appropriate server name or IP address, username, and the password you specified during the container creation.

Screenshot (145)

Setup Airbyte We will run Airbyte on your local machine as a separate Docker Compose project. You can follow these instructions to get it up and running.

You can use any database management tool you prefer. Now that you have your SQL Server instance set up to allow connections Remotely, you can begin configuring the Airbyte Source. Login and create a new connection and select Microsoft SQL Server as the source type and give it a name. You can find more information about the SQL Server Airbyte connector at the link here. Enter the host, the port, the database name (airbyte in this case), and the user and password you used when setting up SQL Server. Once configured, click on set up the destination.

Screenshot (147) Make sure to select CDC as the replication method. We are going to use SSH for this example. You can select any one of the SSH method To Connect to server using SSH - username@IpAddress Screenshot (148)

Provide the following values in the UI. I’m connecting to my Remote SQL Server installation, which runs in the default port. Feel free to adjust the values based on your environment.

Step 2: Set up PostgreSQL as the Airbyte destination To set up PostgreSQL as your Airbyte destination you will have to allow connections from Airbyte Cloud to your cluster. Login to Digital Ocean and go to Database> Create Database > Select PostgreSQL as Your database engine.

Screenshot (146)

After successfully creating database check for connection details Screenshot (149)

To set up your destination, select PostgreSQL as your destination type and give it a name. You can find more information about the PostgreSQL Airbyte connector at the link here. Enter the host, the port, and the database name (dev in this case), and also enter the user and password you used when creating your postgres and click on set up Destination.

Screenshot (151)

Step 3: Set up a SQL Server to PostgreSQL connection Once the source and destination are configured, you can access your connection settings. You can set the Replication frequency depending on how often you want Airbyte to replicate your data. 6294bbb8f7f39bf3af47c15e_37uH8thVYkzcJRPlUCbBkRRNql6tOm_UPXhi4T05O9OlEZ7ksvQ3IEwrzz31T82IYEnbfre4ZwhsNNWXzGzFsx9KgMc1F0AqBkugU6jYwnlHkwafNCt0EKEBlWqt7Cv62xGrLCJOWEOsigbjWw

Next, you can choose which tables to sync and set the sync frequency and the sync mode for each table individually. This example will select the customers table and set the Sync more to Incremental | Append.

https://assets-global.website-files.com/6064b31ff49a2d31e0493af1/6294bbb82dcd4ab1010f8ff4_3G0qRMYSRzzI9VdJ4oEFJdmriG3pVpo6n6DXQpLGL8qrvM6DxzJXmyYIlxtns9h-gmB0rkprKPxjreeUT7kH3ua9-s9iqcrP1d_VM5XnbXwPmToHJUIiFLMWtrrpX42FXsyDLbYhcRQ-URB3Ig.png

You can also choose between using Raw Data or Basic Normalization. We will select Basic Normalization to set up the connection in this example. You can also choose to apply custom data transformations, but we will keep it simple by skipping the data transformation part in this example.

Once configured, save the connection and select Sync now to run your first sync once configured. Once the sync is complete, you should see how many rows were replicated (849 in this case). https://assets-global.website-files.com/6064b31ff49a2d31e0493af1/6294bbb9ff6182b2e6624b71_LeBkf3ads67reBn5tI4kssBTx6M7CU1KOyH6-EA0aQ3eXsnQy63O3pi3DlTxJqAuv3N9CZDyveWBzQ4IAc5q9h90vKD0yUvOqdk-5EPzvO8W3imLLHlrgXl6LTc_2UIOlBlki8pHbk_GNGBQiQ.png

To view the replicated data, go to the Postgres (PgAdmin4) select your database to view the tables created by Airbyte Cloud. Screenshot (153)

To test out the incremental sync, you can add some more rows to your SQL Server table. In this example, 10 more rows were added. Once you add some more data you can run another sync. The 10 newly added rows are replicated up to PostgreSQL. https://assets-global.website-files.com/6064b31ff49a2d31e0493af1/6294bbbad86e9fbb625cd628_jSAWLE1jeq9PBseKJfCh379Llek-qBpxTp0dwHHmGeQTumw75Q0ila1sZq9LWPl52cOEGgqpR4S-JNoCpkWarOZoLs__J2gVX1P4WfCeQa_B5J29XjeIF3VhsM2zg2nnqZlKFmb0ieqjLX_-fQ.png

Conclusion To summarize, we look at how we can replicate data from SQL Server to PostgreSQL using Airbyte Cloud by:

Configuring a SQL Server Airbyte Cloud source. Configuring a PostgreSQL Airbyte Cloud destination. Creating an Airbyte connection that automatically replicates data from SQL Server to Redshift. Incrementally syncing SQL Server data to PostgreSQL. We know that development and operations teams working on fast-moving projects with tight timelines need quick answers to their questions from developers actively developing Airbyte. They also want to share their learnings with experienced community members who have "been there and done that." Join the conversation at Airbyte's community Slack Channel to share your ideas with over 1000 data engineers and help make everyone's project successful.

About The Author Ajiye is a Data Engineer and a technical Writer . You can anticipate some of his articles coming up on Medium

Linkedin

Sabareh commented 1 year ago

Hey @adunajiye was this published?