dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
https://dlthub.com/docs
Apache License 2.0
2.38k stars 154 forks source link

Error loading data from MSSQL when using a compound merge key #1874

Open mattgyton opened 3 days ago

mattgyton commented 3 days ago

dlt version

1.0.0

Describe the problem

After upgrading from DLT v0.4.9 to v1.0.0 I've started to get the following error when loading data from MSSQL using a compound merge key (i.e. ("FirstID","SecondID")):

<class 'dlt.common.schema.exceptions.UnboundColumnException'> In schema: mssql_source: The column _first_id_second_idx in table test_import did not receive any data during this load. It is marked as non-nullable merge key and it must have values. This can happen if you specify the column manually, for example using the 'merge_key', 'primary_key' or 'columns' argument but it does not exist in the data.

I've created a basic test script and replicated the error. I tried running the DLT script in v0.4.9 and it worked fine, then I ran it again in v1.0.0 and I got the error (with the reference to the sql_database module updated). I've hard-coded all the values so you can see exactly what I'm using.

Test SQL table:

USE Sandbox
GO

CREATE TABLE dbo.DltTest
(
    FirstID INT NOT NULL,
    SecondID INT NOT NULL,
    TestValue VARCHAR(100) NULL,
    UpdatedDate DATETIME NOT NULL,
    CONSTRAINT [PK_DltTest] PRIMARY KEY CLUSTERED (FirstID, SecondID)
)
GO

INSERT INTO dbo.DltTest (FirstID, SecondID, TestValue, UpdatedDate) VALUES
(1, 1, 'Test1', GETDATE()),
(1, 2, 'Test2', GETDATE()),
(1, 3, 'Test3', GETDATE()),
(2, 1, 'Test4', GETDATE()),
(2, 2, 'Test5', GETDATE()),
(3, 1, 'Test6', GETDATE())
GO

DLT script:

import dlt
from dlt.sources.credentials import ConnectionStringCredentials
from dlt.sources.sql_database import sql_table
#from sql_database import sql_table
from datetime import datetime, timezone

@dlt.source
def mssql_source(
    username = dlt.secrets.value,
    password = dlt.secrets.value,
    host = dlt.secrets.value,
    port = dlt.secrets.value
    ):

    credentials = ConnectionStringCredentials()

    credentials.drivername = "mssql+pymssql"
    credentials.username = username
    credentials.password = password
    credentials.host = host
    credentials.port = port
    credentials.database = "Sandbox"

    connection_string = credentials.to_native_representation()

    data = sql_table(
        credentials=connection_string,
        table="DltTest",
        schema="dbo",
        backend="pyarrow",
        chunk_size=100000,
        incremental=dlt.sources.incremental(
            "UpdatedDate", 
            initial_value=datetime(1900, 1, 1, tzinfo=timezone.utc)
        ),
    )

    data.apply_hints(
        write_disposition="merge",
        merge_key='("FirstID", "SecondID")',
    )
    yield data

pipeline = dlt.pipeline(
    pipeline_name="DltCompoundKeyTest",
    destination="snowflake",
    dataset_name="DLT_TEST",
)
pipeline.run(mssql_source())

Expected behavior

Data is loaded incrementally from MSSQL into Snowflake using the compound merge key

Steps to reproduce

See scripts above

Operating system

Linux

Runtime environment

Local

Python version

3.10

dlt data source

sql_database (MSSQL)

dlt destination

Snowflake

Other deployment details

No response

Additional information

No response