Closed ghost closed 4 years ago
may want to look into using a dw instead of sql, from there you can use polybase to load the data very quickly into the dw.
There's a couple of tuning suggestions in this article. Whether or not to use TABLOCK for parallelism depends on the type of table your using (heap/CCI), and try increasing the batch size to 1048576. Ordering the data prior to the insert may be another lever.
There's a couple of tuning suggestions in this article. Whether or not to use TABLOCK for parallelism depends on the type of table your using (heap/CCI), and try increasing the batch size to 1048576. Ordering the data prior to the insert may be another lever.
That was a helpful article, I changed the largest table in my db into a heap, then bulk insert happened within 25 mins for 180+ million records. With indexes the bulk insert took more than 80 hours with the db in 8v core elastic pool
I have a data frame with around 5 million rows and looking to bulk upload to azure sql (in a pool - BusinessCritical: Gen5, 4 vCores where only this db exists).
I tried to increase the cluster but still not a big improvement, tried E8s_v3 with 8 worker nodes still it took an hour to only insert 300,000 records.
I am unsure whats missing. For testing I download the data from source db to a data frame and just bulk load the data frame to another azure sql db that has the exact same structure (created with the DACPAC of source db)
Below is a snippet of my code, val queryConfig = Config(Map( "url" -> TargetDatabase.ServerName, "databaseName" -> TargetDatabase.DatabaseName, "dbTable" -> target, "user" -> TargetDatabase.ServerUser, "password" -> TargetDatabase.ServerPassword, "connectTimeout" -> "0", "queryTimeout" -> "0" ))
val df = spark.read.sqlDB(queryConfig)