datafold / data-diff

Compare tables within or across databases
https://docs.datafold.com
MIT License
2.93k stars 262 forks source link

[SQL Server] data diff fails with case sensitive schemas and objects #884

Closed MiConnell closed 4 months ago

MiConnell commented 5 months ago

Describe the bug when running data diff comparing sql server to redshift, I'm getting the error:

data_diff.databases.base.QueryError: ('42S02', "[42S02] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Invalid object name '<database>.information_schema.columns'. (208) (SQLExecDirectW)")

The actual object name is <database>.INFORMATION_SCHEMA.COLUMNS

updating this line to uppercase solves the problem.

info_schema_path = ["INFORMATION_SCHEMA", "COLUMNS"]

Make sure to include the following (minus sensitive information): command:

data-diff -d  <redshift_uri>   <schema>.<table> <mssql_uri>   dbo.<TABLE>   -k <KEY_COLUMN> 

and

data-diff -d  <redshift_uri>   <schema>.<table> <mssql_uri>   dbo.<TABLE>   -k <KEY_COLUMN> --case-sensitive

output:

16:58:27 INFO     [Redshift] Starting a threadpool, size=1.                                                                                                       base.py:1237
         DEBUG    Running SQL (Redshift):                                                                                                                          base.py:982
                  SET TIME ZONE 'UTC'                                                                                                                                         
         INFO     [MsSQL] Starting a threadpool, size=1.                                                                                                          base.py:1237
         DEBUG    Database 'MsSQL(default_schema='dbo', _interactive=False, is_closed=False, _dialect=Dialect(_prevent_overflow_when_concat=False),            _connect.py:300
                  thread_count=1, _init_error=None, _queue=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7fd5d78c0fd0>,                                           
                  thread_local=<_thread._local object at 0x7fd5d7558a90>, default_database='database', _args={<credentials>, 'driver': '{ODBC Driver 18 for SQL                  
                  Server}', 'TrustServerCertificate': 'yes'}, _mssql=None)' does not allow setting timezone. We recommend making sure it's set to 'UTC'.                      
         INFO     Using algorithm 'hashdiff'.                                                                                                                  __main__.py:394
         DEBUG    Running SQL (Redshift): ('schema', 'table')                                                                                       base.py:980
                  SELECT column_name, data_type, datetime_precision, numeric_precision, numeric_scale FROM information_schema.columns WHERE table_name =                      
                  'table' AND table_schema = 'schema'                                                                                                          
         DEBUG    Running SQL (MsSQL): ('dbo', 'TABLE')                                                                                                base.py:980
                  SELECT column_name, data_type, datetime_precision, numeric_precision, numeric_scale, collation_name FROM                                                    
                  [database].information_schema.columns WHERE table_name = 'TABLE' AND table_schema = 'dbo'                                                
         ERROR    ('42S02', "[42S02] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Invalid object name 'database.information_schema.columns'.   __main__.py:344
                  (208) (SQLExecDirectW)")                                                                                                                                    
Traceback (most recent call last):
  File "/home/ubuntu/ingestion/venv/lib/python3.10/site-packages/data_diff/databases/mssql.py", line 227, in _query_cursor
    return super()._query_cursor(c, sql_code)
  File "/home/ubuntu/ingestion/venv/lib/python3.10/site-packages/data_diff/databases/base.py", line 1173, in _query_cursor
    c.execute(sql_code)
pyodbc.ProgrammingError: ('42S02', "[42S02] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Invalid object name 'database.information_schema.columns'. (208) (SQLExecDirectW)")

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/ingestion/venv/bin/data-diff", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/ingestion/venv/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/ingestion/venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/ubuntu/ingestion/venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ubuntu/ingestion/venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/ubuntu/ingestion/venv/lib/python3.10/site-packages/data_diff/__main__.py", line 342, in main
    _data_diff(dbt_project_dir=project_dir_override, dbt_profiles_dir=profiles_dir_override, state=state, **kw)
  File "/home/ubuntu/ingestion/venv/lib/python3.10/site-packages/data_diff/__main__.py", line 571, in _data_diff
    schemas = list(differ._thread_map(_get_schema, safezip(dbs, table_paths)))
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
    yield _result_or_cancel(fs.pop())
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
    return fut.result(timeout)
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/ubuntu/ingestion/venv/lib/python3.10/site-packages/data_diff/__main__.py", line 75, in _get_schema
    return db.query_table_schema(table_path)
  File "/home/ubuntu/ingestion/venv/lib/python3.10/site-packages/data_diff/databases/base.py", line 1048, in query_table_schema
    rows = self.query(self.select_table_schema(path), list, log_message=path)
  File "/home/ubuntu/ingestion/venv/lib/python3.10/site-packages/data_diff/databases/base.py", line 996, in query
    res = self._query(sql_code)
  File "/home/ubuntu/ingestion/venv/lib/python3.10/site-packages/data_diff/databases/base.py", line 1248, in _query
    return r.result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/ubuntu/ingestion/venv/lib/python3.10/site-packages/data_diff/databases/base.py", line 1254, in _query_in_worker
    return self._query_conn(self.thread_local.conn, sql_code)
  File "/home/ubuntu/ingestion/venv/lib/python3.10/site-packages/data_diff/databases/base.py", line 1188, in _query_conn
    return apply_query(callback, sql_code)
  File "/home/ubuntu/ingestion/venv/lib/python3.10/site-packages/data_diff/databases/base.py", line 211, in apply_query
    return callback(sql_code)
  File "/home/ubuntu/ingestion/venv/lib/python3.10/site-packages/data_diff/databases/mssql.py", line 229, in _query_cursor
    raise QueryError(e)
data_diff.databases.base.QueryError: ('42S02', "[42S02] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Invalid object name 'database.information_schema.columns'. (208) (SQLExecDirectW)")

If possible, please paste these as text, and not a screenshot.

Describe the environment

Ubuntu@data_diff v0.11.1

glebmezh commented 4 months ago

Hi @MiConnell,

Thank you for trying out data-diff and for taking the time to open this issue. We made a hard decision to sunset the data-diff package and won't provide further development or support. Diffing functionality will continue to be available in Datafold Cloud. We have completely rewritten the diffing engine in the cloud over the past few months and have solved the fundamental issues with the original algorithm used in the data-diff package. We also substantially improved the experience for SQL Server 2019+. Feel free to take it for a trial or contact us at support@datafold.com if you have any questions.

-Gleb