Azure / azure-kusto-spark

Apache Spark Connector for Azure Kusto
Apache License 2.0
77 stars 34 forks source link

Overwrite data option not working #371

Closed aayushsin closed 5 months ago

aayushsin commented 5 months ago

Is your feature request related to a problem? Please describe. Please provide the feature to work along with overwrite option in Kusto write.

Describe the solution you'd like I want to replace or overwrite the data in the kusto table in a single command using python. But the overwrite option is not yet supported in kusto write. I saw some tickets to use dropByExtents tag to do so. But there are no examples on the same. Can we please add an example of the tags to use in case one needs to overwrite or replace the data in the kusto table

Additional context I am assuming that the tag works by adding first and then deleting or the operations happen at the same time. We do not want a solution where data is dropped first and added later which might case reading issues in the meantime.

ag-ramachandran commented 5 months ago

Hello @aayushsin

Kusto targets from spark connector are by default append only. So data in the dataframes will be appended to Kusto tables. We may need help in understanding your scenario. Here are 2 broad scenarios where we have seen customers hit similar requirements

a) A full load happens every day. Tables are replaced daily

  1. We can load the temp tables using the spark job. This can be FullLoadTable_yyyyMMdd
  2. Once the temp tables are loaded in a new section (in the notebook) , add the following instructions to swap tables (viz. FullLoadTable_yyyyMMdd => FullLoadTable)

b) Delta load happens, but newer values need to come in and existing values need to get replaced. There is no direct spark solution for this, the best option here would be to perform materialization to get the latest record

https://learn.microsoft.com/en-us/azure/data-explorer/kusto/management/materialized-views/materialized-view-overview