Azure / azure-kusto-spark

Apache Spark Connector for Azure Kusto
Apache License 2.0
77 stars 34 forks source link

Ingest without spark temp tables #336

Closed jainaashish closed 10 months ago

jainaashish commented 1 year ago

I'm using below code to ingest data from spark to kusto but with each ingestion, it creates spark temp table (hidden) which are never cleaned. Is there a way to tell spark not to create these temp tables or clean afterwards?

sp = sc._jvm.com.microsoft.kusto.spark.datasink.SparkIngestionProperties(True, None, None, None, None, extentsCreationTime, None, None)
finalResult.write \
    .format("com.microsoft.kusto.spark.synapse.datasource") \
    .option("spark.synapse.linkedService", linked_service) \
    .option("kustoDatabase", database) \
    .option("sparkIngestionPropertiesJson", sp.toString()) \
    .option("kustoTable", table) \
    .option("tableCreateOptions", "CreateIfNotExist") \
    .mode("Append") \
    .save()
ag-ramachandran commented 1 year ago

Hello @jainaashish The following is the lifecycle of these temp tables

Normal lifecycle

a) Each spark worker creates these temp tables so that ingest can happen independently. When these tables are created, they are marked hidden and also a policy is applied on them to make them disappear in 7 days (this is a fallback, an insurance of sort)

b) When the Spark job runs and then complete, the tables will get dropped after the run

When do tables get left behind

Tables are left behind when there are Job aborts without having an opportunity to call the cleanup hook. This could be user failures, unhandled errors or node failures of the worker. This is where the 7 day policy comes in. The tables left behind are cleaned up after 7 days.

You can verify this by running

.show table <Table> policy auto_delete

Known issue - Rare scenario

There is one rare case where the cleanup is aborted. IIRC, The following sequence causes it

a) Temp tables are created and policy applied b) Ingests are Queued c) Job aborts, the temp tables are cleared up, but there are unfinished queued ingests. The queued ingests happen and they create a table without a delete policy

This was rare and due to nature of the timing, it was very hard to apply a fix for it.

In Conclusion

a) If there are policies on the table already, we have a mitigation already. The tables should get dropped in 7 days anyway. b) If we don't have a policy, its probably one off. We'd probably have to mitigate it manually in some way (run a cleanup as a schedule) c) There is an option called writeMode that can be set to Queued in spark. In this, we do not create temp tables at all. Ingests are queued and then exit. The only difference between the default mode and this is that consistency is eventual, i.e Spark jobs will complete, but for the data to show up on the tables it would take a little longer.

Relevant code references

Table create : https://github.com/Azure/azure-kusto-spark/blob/0f0892966402bd4d19ea4888574fbe27fc27f51e/connector/src/main/scala/com/microsoft/kusto/spark/utils/CslCommandsGenerator.scala#L54C3-L54C3

Table policy alter : https://github.com/Azure/azure-kusto-spark/blob/0f0892966402bd4d19ea4888574fbe27fc27f51e/connector/src/main/scala/com/microsoft/kusto/spark/utils/CslCommandsGenerator.scala#L179C8-L179C8

Drop table after run : https://github.com/Azure/azure-kusto-spark/blob/0f0892966402bd4d19ea4888574fbe27fc27f51e/connector/src/main/scala/com/microsoft/kusto/spark/datasink/FinalizeHelper.scala#L118C23-L118C49

jainaashish commented 1 year ago

Thanks ag-ramachandran for the detailed response. I dont mind a little extra latency so will look into writeMode option.

For the left over temp tables, I saw there was no auto_delete policy on those tables and I saw the incident across multiple databases so this might not be a super edge case scenario. I'll keep an eye out and respond if I encounter this again.