Table is not responding in bulkCopyToSqlDB

demoV commented 4 years ago

Hi,

We are processing ~400M records. when we are trying to bulk insert records the Spark job is failing with query time out. And also the table (Destination to insert) is not responding for some time.

our config is

{ "url": "<URL>", "databaseName": "<DB_NAME>", "dbTable": "<DB_TABLE>" "bulkCopyBatchSize": 100000, "bulkCopyTimeOut": 600, "bulkCopyTableLock": "true", "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver" } We have tried with more than 1000 partitions as well.

Please help to tackle this issue.

MironAtHome commented 4 years ago

Have you tried to explore what exactly causes timeout? Is it possible table was in use and query, as per above, could not secure possession of exclusive lock on underlying table? Regarding "the table (Destination to insert) is not responding for some time." that would probably be expected, since you are using table lock. If you need to load data and use table simultaneously I recommend removing option "bulkCopyTableLock": "true" and reducing size of batch. This will lengthen loading time, but will be more friendly to what you are trying to achieve. If your requirements more stringent, please consider partitioning table loaded from Spark.

ghost commented 4 years ago

Does your destination table have indexes? I had performance issue for large tables with indexes.

Droping indexes before bulk insert and re-creating indexes at the end helped me improve performance significantly. I was able to bulk insert 180 million records within 23 mins to a heap. Azure SQL elastic pool with general purpose 8 v cores and cluster with 8 work nodes of DS12v2

Link below would be useful: https://techcommunity.microsoft.com/t5/datacat/turbo-boost-data-loads-from-spark-using-sql-spark-connector/ba-p/305523#

arvindshmicrosoft commented 4 years ago

@demoV as your question is quite environment-specific it is difficult to give prescriptive guidance. The general tips I would mention are:

Ensure that the principles and tips mentioned in the data loading guidance are followed
For larger amounts of data which are eventually to be used for analytical or reporting workloads, we do recommend loading data directly into Clustered Columnstore Indexes as mentioned in Columnstore Indexes Data Loading Guidance
If you are loading data into Azure SQL Database, please note that the recovery model is full. That means that the transaction log can quickly become a bottleneck, as log I/O is governed and has limits in Azure SQL DB. However, using columnstore indexes as mentioned previously, greatly reduces the amount of log I/O and consequently can have benefits in Azure SQL DB cases.
The previous comments on this thread are also very important to troubleshoot the issue

Please let me know if there are any other questions, else I would like to close this thread by July 3, 2020.

arvindshmicrosoft commented 4 years ago

Closing per previous comment and subsequent inactivity.

Azure / azure-sqldb-spark

Table is not responding in bulkCopyToSqlDB #59