Azure / azure-sqldb-spark

This project provides a client library that allows Azure SQL DB or SQL Server to act as an input source or output sink for Spark jobs.
MIT License
76 stars 52 forks source link

Question: how to improve performance of bulkCopyToSqlDB #53

Closed ghost closed 4 years ago

ghost commented 4 years ago

I have a data frame with around 5 million rows and looking to bulk upload to azure sql (in a pool - BusinessCritical: Gen5, 4 vCores where only this db exists).

I tried to increase the cluster but still not a big improvement, tried E8s_v3 with 8 worker nodes still it took an hour to only insert 300,000 records.

I am unsure whats missing. For testing I download the data from source db to a data frame and just bulk load the data frame to another azure sql db that has the exact same structure (created with the DACPAC of source db)

Below is a snippet of my code, val queryConfig = Config(Map( "url" -> TargetDatabase.ServerName, "databaseName" -> TargetDatabase.DatabaseName, "dbTable" -> target, "user" -> TargetDatabase.ServerUser, "password" -> TargetDatabase.ServerPassword, "connectTimeout" -> "0", "queryTimeout" -> "0" ))

val df = spark.read.sqlDB(queryConfig)

david-nguyen commented 4 years ago

may want to look into using a dw instead of sql, from there you can use polybase to load the data very quickly into the dw.

Illuminae commented 4 years ago

There's a couple of tuning suggestions in this article. Whether or not to use TABLOCK for parallelism depends on the type of table your using (heap/CCI), and try increasing the batch size to 1048576. Ordering the data prior to the insert may be another lever.

ghost commented 4 years ago

There's a couple of tuning suggestions in this article. Whether or not to use TABLOCK for parallelism depends on the type of table your using (heap/CCI), and try increasing the batch size to 1048576. Ordering the data prior to the insert may be another lever.

That was a helpful article, I changed the largest table in my db into a heap, then bulk insert happened within 25 mins for 180+ million records. With indexes the bulk insert took more than 80 hours with the db in 8v core elastic pool