I am running -for the fun of it- duckdb directly on Azure Databricks. My ultimate goal is to pitch duckdb against Databricks Photon, to see if it can compete with Photon while running on Databricks.
Instead of authenticating via SAS keys etc to an Azure storage account, I've used the Azure Databricks Volume, which is seen as a local reference and that works good:
A newly created table is working and I've copied over 10x + optimize on it:
This table has very high Delta versions and features enabled:
delta.checkpoint.writeStatsAsJson=false delta.checkpoint.writeStatsAsStruct=true delta.enableDeletionVectors=true delta.feature.appendOnly=supported delta.feature.checkConstraints=supported delta.feature.deletionVectors=supported delta.feature.invariants=supported delta.minReaderVersion=3 delta.minWriterVersion=7
And old table, unmodified for some time is not working:
But the table has much lower Delta versions:
delta.minReaderVersion=1 delta.minWriterVersion=6
I did upgrade to a higher version:
delta.feature.appendOnly=supported delta.feature.changeDataFeed=supported delta.feature.checkConstraints=supported delta.feature.generatedColumns=supported delta.feature.identityColumns=supported delta.feature.invariants=supported delta.minReaderVersion=3 delta.minWriterVersion=7
But still the same json error from duckdb.
Then I found out that the original table does have a Databricks identity column (introduced in Databricks 10.4, but I think that this feature is outside the open source delta spec):
CREATE TABLE IF NOT EXISTS SCHEMA.TABLE ( ID bigint GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1) )
My new table was copied without the identity and that works fine.
Looking into the delta table structure in the crc are the following entries
"schemaString": "{\"type\":\"struct\",\"fields\":[{\"name\":\"ID\",\"type\":\"long\",\"nullable\":true,\"metadata\":{\"delta.identity.start\":1,\"delta.identity.step\":1,\"delta.identity.allowExplicitInsert\":false}}]}",
And further down these:
"writerFeatures": [ "identityColumns", "deletionVectors" ]
So the good news:
duckdb works out of the box on Azure Databricks Volumes
duckdb works with very high Delta versions, including deletion vectors (I did not test special characters in column names btw, spaces, {} [] () etc...)
The bad news:
Databricks has some features outside the open source spec (?) that let duckdb fail
I am running -for the fun of it- duckdb directly on Azure Databricks. My ultimate goal is to pitch duckdb against Databricks Photon, to see if it can compete with Photon while running on Databricks.
Instead of authenticating via SAS keys etc to an Azure storage account, I've used the Azure Databricks Volume, which is seen as a local reference and that works good:
A newly created table is working and I've copied over 10x + optimize on it:
This table has very high Delta versions and features enabled:
delta.checkpoint.writeStatsAsJson=false delta.checkpoint.writeStatsAsStruct=true delta.enableDeletionVectors=true delta.feature.appendOnly=supported delta.feature.checkConstraints=supported delta.feature.deletionVectors=supported delta.feature.invariants=supported delta.minReaderVersion=3 delta.minWriterVersion=7
And old table, unmodified for some time is not working: But the table has much lower Delta versions:
delta.minReaderVersion=1 delta.minWriterVersion=6
I did upgrade to a higher version:
delta.feature.appendOnly=supported delta.feature.changeDataFeed=supported delta.feature.checkConstraints=supported delta.feature.generatedColumns=supported delta.feature.identityColumns=supported delta.feature.invariants=supported delta.minReaderVersion=3 delta.minWriterVersion=7
But still the same json error from duckdb.Then I found out that the original table does have a Databricks identity column (introduced in Databricks 10.4, but I think that this feature is outside the open source delta spec):
CREATE TABLE IF NOT EXISTS SCHEMA.TABLE ( ID bigint GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1) )
My new table was copied without the identity and that works fine.Looking into the delta table structure in the crc are the following entries
"schemaString": "{\"type\":\"struct\",\"fields\":[{\"name\":\"ID\",\"type\":\"long\",\"nullable\":true,\"metadata\":{\"delta.identity.start\":1,\"delta.identity.step\":1,\"delta.identity.allowExplicitInsert\":false}}]}",
And further down these:"writerFeatures": [ "identityColumns", "deletionVectors" ]
So the good news:
The bad news: