TobikoData / sqlmesh

Efficient data transformation and modeling framework that is backwards compatible with dbt.
https://sqlmesh.com
Apache License 2.0
1.6k stars 142 forks source link

Seed files do not correctly display Danish characters æ, ø, å when exported to MSSQL #2528

Closed jonaswvd closed 4 months ago

jonaswvd commented 4 months ago

UTF-8 encoded csv files with æøå, are displayed as weird characters æ, ø etc. in the database when using MSSQL.

I believe the issue is due to use of the pymssql's bulk_copy() method at: https://github.com/TobikoData/sqlmesh/blob/main/sqlmesh/core/engine_adapter/mssql.py#L180

How to reproduce

-- Assumes a table exist like:
CREATE TABLE [sqlmesh__portman].[test_bulk_copy](
       [transaction_type_key] [bigint] NULL,
       [transaction_code] [varchar](max) NULL,
       [transaction_type] [varchar](max) NULL,
       [alternative_transaction_subtype] [varchar](max) NULL,
       [investment_activity_type] [varchar](max) NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
import pymssql

mssql_user = "user"
mssql_password = "password"
host = "172.20.0.9"
port = 1435

conn = pymssql.connect(
    host=host,
    port=port,
    user=mssql_user,
    password=mssql_password,
    database="datafoundations_test",
    charset="UTF-8"  # charset=None neither works
)

rows = [
    (19, 'RG', 'Værdiregulering', ' ', None), 
    (21, '00', 'Køb af aktier', None, None)
]

conn.bulk_copy('sqlmesh__portman.test_bulk_copy', rows)
conn.commit()
conn.close()

A workaround

I can get bulk_copy() to work if charset="cp1252" or charset="ISO-8859-1" is set on the connection:

import pymssql

mssql_user = "user"
mssql_password = "password"
host = "172.20.0.9"
port = 1435

conn = pymssql.connect(
host=host,
port=port,
user=mssql_user,
password=mssql_password,
database="datafoundations_test",
charset="cp1252" or charset="ISO-8859-1",
)

rows = [
    (19, 'RG', 'Værdiregulering', ' ', None), 
    (21, '00', 'Køb af aktier', None, None),
]

conn.bulk_copy('sqlmesh__portman.test_bulk_copy', rows)
conn.commit()
conn.close()

For charset="UTF-8" I haven't found a solution, this might be due to bulk_copy problems: https://github.com/pymssql/pymssql/issues/725

Background

There seems to be multiple configuration options to consider:

I'm not completely sure that they are all relevant, as I haven't found any solutions with UTF-8 charset. Similarly, specifying either "cp1252" or charset="ISO-8859-1" works independently of the datatype and collation type.

Nevertheless the following concerning MSSQL datatype encoding and collation for MSSQL might be relevant: I believe SQLmesh converts columns of text to varchar columns in MSSQL. Such does not support unicode unless with an UTF-8 enabled collation: https://learn.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-ver16#unicode-data-types

I have tested the charset="UTF-8" with all combinations of the following with no luck:

Possible solutions (https://github.com/pymssql/pymssql/issues/725):

We use the Danish_Norwegian_CI_AS on our server, so all new tables and columns will be created with non-unicode supporting varchar types by default. So a solution that works independently of collation and with varchar might be optimal.

treysp commented 4 months ago

Hello - could you try specifying the charset key in your connection configuration to one of the working character sets? https://sqlmesh.readthedocs.io/en/stable/integrations/engines/mssql/

JonasDavisFondene commented 4 months ago

Oh wow, there's a config for that! :) I had it hardcoded here in my local files for a temp workaround: https://github.com/TobikoData/sqlmesh/blob/main/sqlmesh/core/config/connection.py#L980

I can confirm that setting the charset = "cp1252" in the gateway config works as well. So I guess there's actually not a bug, and this issue can be closed.

Although, initially I would never had known that to fix the issue I just had to specify a little thing in the config, took me quite some time to figure out what was going on. So maybe some kind of warning or hint somewhere in the documentation for windows mssql users that specifying charset might be necessary?