databricks / dbt-databricks

A dbt adapter for Databricks.
https://databricks.com
Apache License 2.0
223 stars 119 forks source link

Character encoding changes upon seeding #332

Open jdbodyfelt opened 1 year ago

jdbodyfelt commented 1 year ago

Describe the bug

A CSV file that has UTF-8 encoding is seeded with dbt seed. Upon review of the load, the column encoding has appeared to change.

Steps To Reproduce

Create a CSV with non-standard non-Roman UTF-8 characters (Arabic, Greek, etc.) and try seeding it.

Expected behavior

I expect a CSV seeds exactly what is inside of it, ESPECIALLY strings.

Screenshots and log output

CSV: image Injection Result: image

System information

The output of dbt --version:

Core:
  - installed: 1.4.6
Plugins:
  - databricks: 1.4.3 

The operating system you're using:
Ubuntu 22.04.1 LTS

The output of python --version: Python 3.10.6

Additional context

It would be great to have a seeds configuration option for column encoding, e.g.

seeds:
   - name: <tableName>
      config:
         columns:
             - name: <columeName>
                dtype: <columnDatatype>
                encoding: <columnEncoding if STRING or VARCHAR>
github-actions[bot] commented 1 year ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue.