Azure / azure-cosmosdb-spark

Apache Spark Connector for Azure Cosmos DB
MIT License
202 stars 121 forks source link

Streaming insert onto Cosmos db is not working #409

Open deepaksekaranz opened 4 years ago

deepaksekaranz commented 4 years ago

I am trying to insert a streaming dataframe onto Cosmos from databricks but the streaming job is not starting (it is in the initializing stage for hours)

Jar : azure_cosmosdb_spark_2_4_0_2_11_3_3_0_uber.jar

Cluster spark version - 2.4.5

Cluster scala version - 2.11

Screen Shot 2020-09-30 at 7 59 03 am
revinjchalil commented 4 years ago

Do the executor logs show progress? Below is the snippet using the same v3.3.0 jar on ADB ingesting a streaming DF to Cosmos DB container, plz follow this and reach out directly if needed.

Capture

image

Thanks, Revin

From: Deepak Sekar notifications@github.com Sent: Tuesday, September 29, 2020 3:01 PM To: Azure/azure-cosmosdb-spark azure-cosmosdb-spark@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [Azure/azure-cosmosdb-spark] Streaming insert onto Cosmos db is not working (#409)

I am trying to insert a streaming dataframe onto Cosmos from databricks but the streaming job is not starting (it is in the initializing stage for hours)

Jar : azure_cosmosdb_spark_2_4_0_2_11_3_3_0_uber.jar

Cluster spark version - 2.4.5

Cluster scala version - 2.11

[Screen Shot 2020-09-30 at 7 59 03 am]https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F66948463%2F94621038-f1fa8380-02f2-11eb-869a-f78fac247e41.png&data=02%7C01%7Crevin.chalil%40microsoft.com%7C39cf15f3e2cd41cef52308d864c335eb%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637370136871006378&sdata=CgjD6w4ENyUj143mAg3ymYyfULZvryaqXZwDgfyQJig%3D&reserved=0

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-cosmosdb-spark%2Fissues%2F409&data=02%7C01%7Crevin.chalil%40microsoft.com%7C39cf15f3e2cd41cef52308d864c335eb%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637370136871016373&sdata=aKpMWjy9Wdc8Z24hDcleJsEItn30uQsfktiC63NadOw%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FADFATSII7TT4ZZLWN2Y7HKDSIJKLLANCNFSM4R6KZ2IQ&data=02%7C01%7Crevin.chalil%40microsoft.com%7C39cf15f3e2cd41cef52308d864c335eb%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637370136871016373&sdata=i3IdIKlTSC218zFXGJq4FE9rKiRTesY7RmjTHu0h2T8%3D&reserved=0.

deepaksekaranz commented 4 years ago

Doesn't work if you increase the rate to anything greater than 5 rows/ sec. Try the same with 50 rows/sec?

revinjchalil commented 4 years ago

It depends on the Throughput (RU/s) allocated on the Target Cosmos DB Container and also on the number of cores available on the spark cluster. If the target container does not have enough RU/s, the ingestion will be throttled and is most probably the case being experienced here and so increasing the RU/s will help.

Thanks, Revin

From: Deepak Sekar notifications@github.com Sent: Thursday, October 1, 2020 4:20 PM To: Azure/azure-cosmosdb-spark azure-cosmosdb-spark@noreply.github.com Cc: Revin Chalil Revin.Chalil@microsoft.com; Comment comment@noreply.github.com Subject: Re: [Azure/azure-cosmosdb-spark] Streaming insert onto Cosmos db is not working (#409)

Doesn't work if you increase the rate to anything greater than 5 rows/ sec. Try the same with 50 rows/sec?

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-cosmosdb-spark%2Fissues%2F409%23issuecomment-702445762&data=02%7C01%7Crevin.chalil%40microsoft.com%7C67c0e2dd6a1f488facd708d86660809c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637371911945209446&sdata=sgooS529Nf%2FAX%2FPPCnqsJzPoCwN9Z9v1OCulvFqXEUM%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FADFATSLAYNS433MB2PPSXS3SIUFBTANCNFSM4R6KZ2IQ&data=02%7C01%7Crevin.chalil%40microsoft.com%7C67c0e2dd6a1f488facd708d86660809c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637371911945209446&sdata=UrLgtXb8hXhoxOuylBMQiICF3HP0fVoJGDYU22o%2Fbus%3D&reserved=0.

deepaksekaranz commented 4 years ago

I did try that as well! Didn't work

Screen Shot 2020-10-02 at 10 59 38 am Screen Shot 2020-10-02 at 10 59 58 am Screen Shot 2020-10-02 at 11 01 12 am
revinjchalil commented 4 years ago

The streaming ingestion in the below screenshot is as per the same config as yours, ie: with the 3,000 RU/s and 50 rows/sec rates ingestion. As you can see the micro batches complete in < 1 second. I have used coalesce(10) on the streaming DF to limit the throttling but it should work without this as well.

streamWrite

Please feel free to setup a screenshare session if you would like to troubleshoot online.

Thanks, Revin rechalil@microsoft.com

anuthereaper commented 3 years ago

Hi @revinjchalil,

Can you explain a bit more what "coalesce" does? I could not find the documentation for this. In our implementation, we are writing streaming data into Cosmos. The problem is sometimes there is a spike in the rate of transactions which results in throttling in Cosmos. We have set the throughput to 1K-10K and adjusted the indexing appropriately. Our micro batch operates at 2s intervals. I would like to look at options at limiting this throttling or at least doing retries if the writing to cosmos is throttled. Can you please advise?

Regards, Anupam