Closed MoritzLampert closed 1 year ago
I just noticed, the same error occurs using the RKI Covid
source with the stream states_history_deaths
.
Can confirm, I've also noticed this. It only occurs when using the Azure Blob Storage backend for the Databricks connector, not for S3.
One other problem that is sort of related to this is when a source schema contains the object
data type. This is an invalid data type for Databricks but is very prevalent in a lot of Airbyte sources. These should actually be flattened into separate columns (I think Databricks even offers functionality for this). Or, if this is not possible, convert it into a json dump and insert it with the string
data type and normalize it later on.
I have created a PR on the Azure part of this issue. https://github.com/airbytehq/airbyte/pull/21238
Closing because https://github.com/airbytehq/airbyte/pull/21238 is closed
Environment
Current Behavior
The sync job fails with the error:
Expected Behavior
The sync job should succeed.
Logs
airbyte_log_s3_to_databricks.txt
Steps to Reproduce
Are you willing to submit a PR?
Not sure
Additional Information
I think the problem is that the schema returned from the S3 source consists of arrays as datatypes (see
airbyte-integrations/connectors/source-s3/source_s3/source_files_abstract/stream.py
line 183) but databricks expects strings (seeairbyte-integrations/connectors/destination-databricks/src/main/java/io/airbyte/integrations/destination/databricks/DatabricksAzureBlobStorageStreamCopier.java
line 164) and thus no datatype is inserted into the SQL-statement which results in the given syntax error. The source schema output from the logs: