Azure / azure-cosmosdb-spark

Apache Spark Connector for Azure Cosmos DB
MIT License
202 stars 121 forks source link

Cosmos Changefeed Spark Streaming stops suddenly #468

Closed 0xbadidea closed 2 years ago

0xbadidea commented 2 years ago

I have a Spark streaming job which reads Cosmos Changefeed data as below, running in a Databricks cluster with DBR 8.2.

cosmos_config = {
  "spark.cosmos.accountEndpoint": cosmos_endpoint,
  "spark.cosmos.accountKey": cosmos_key,
  "spark.cosmos.database": cosmos_database,
  "spark.cosmos.container": collection,
  "spark.cosmos.read.partitioning.strategy": "Default",
  "spark.cosmos.read.inferSchema.enabled" : "false",
  "spark.cosmos.changeFeed.startFrom" : "Now",
  "spark.cosmos.changeFeed.mode" : "Incremental"
}

df_ read = (spark.readStream
                 .format("cosmos.oltp.changeFeed")
                 .options(**cosmos_config)
                 .schema(cosmos_schema)
                 .load())

df_write = (df_ read.withColumn("partition_date",current_date())
                    .writeStream
                    .partitionBy("partition_date")
                    .format('delta')
                    .option("path", master_path)
                    .option("checkpointLocation", f"{master_path}_checkpointLocation")
                    .queryName("cosmosStream")
                    .trigger(processingTime='10 seconds')
                    .start()
            )       

While the job works well ordinarily, occasionally, the streaming stops all of a sudden and the below appears in a loop in the log4j output. Restarting the job processes all the data in the 'backlog'. What could be causing this?

22/02/27 00:57:58 INFO HiveMetaStore: 1: get_database: default
22/02/27 00:57:58 INFO audit: ugi=root  ip=unknown-ip-addr  cmd=get_database: default   
22/02/27 00:57:58 INFO DriverCorral: Metastore health check ok
22/02/27 00:58:07 INFO HikariDataSource: metastore-monitor - Starting...
22/02/27 00:58:07 INFO HikariDataSource: metastore-monitor - Start completed.
22/02/27 00:58:07 INFO HikariDataSource: metastore-monitor - Shutdown initiated...
22/02/27 00:58:07 INFO HikariDataSource: metastore-monitor - Shutdown completed.
22/02/27 00:58:07 INFO MetastoreMonitor: Metastore healthcheck successful (connection duration = 88 milliseconds)
22/02/27 00:58:50 INFO RxDocumentClientImpl: Getting database account endpoint from https://<cosmosdb_endpoint>.documents.azure.com:443