duckdb / duckdb_azure

Azure extension for DuckDB
MIT License
50 stars 17 forks source link

MalformedJsonError due to Databricks identity column #62

Open Johannes-Vink opened 6 months ago

Johannes-Vink commented 6 months ago

I am running -for the fun of it- duckdb directly on Azure Databricks. My ultimate goal is to pitch duckdb against Databricks Photon, to see if it can compete with Photon while running on Databricks.

Instead of authenticating via SAS keys etc to an Azure storage account, I've used the Azure Databricks Volume, which is seen as a local reference and that works good: image

A newly created table is working and I've copied over 10x + optimize on it: image

This table has very high Delta versions and features enabled: delta.checkpoint.writeStatsAsJson=false delta.checkpoint.writeStatsAsStruct=true delta.enableDeletionVectors=true delta.feature.appendOnly=supported delta.feature.checkConstraints=supported delta.feature.deletionVectors=supported delta.feature.invariants=supported delta.minReaderVersion=3 delta.minWriterVersion=7

And old table, unmodified for some time is not working: image But the table has much lower Delta versions: delta.minReaderVersion=1 delta.minWriterVersion=6

I did upgrade to a higher version: delta.feature.appendOnly=supported delta.feature.changeDataFeed=supported delta.feature.checkConstraints=supported delta.feature.generatedColumns=supported delta.feature.identityColumns=supported delta.feature.invariants=supported delta.minReaderVersion=3 delta.minWriterVersion=7 But still the same json error from duckdb.

Then I found out that the original table does have a Databricks identity column (introduced in Databricks 10.4, but I think that this feature is outside the open source delta spec): CREATE TABLE IF NOT EXISTS SCHEMA.TABLE ( ID bigint GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1) ) My new table was copied without the identity and that works fine.

Looking into the delta table structure in the crc are the following entries "schemaString": "{\"type\":\"struct\",\"fields\":[{\"name\":\"ID\",\"type\":\"long\",\"nullable\":true,\"metadata\":{\"delta.identity.start\":1,\"delta.identity.step\":1,\"delta.identity.allowExplicitInsert\":false}}]}", And further down these: "writerFeatures": [ "identityColumns", "deletionVectors" ]

So the good news:

The bad news: