bug: Data type of MONEY not handled properly

BuzzCutNorman commented 2 years ago

Hi Team, We are trying to explore meltano as EL solution, As our first EL we are trying to dump data from MSSQL to S3-Parquet replication, tried both FULL_TALBLE and INCREMENTAL but getting same error in both scenarios as below. Log level is DEBUG and we are unable to identity the root cause. Can you please help and let us know if we are doing anything wrong. --variant buzzcutnorman

column type is money in Db

022-12-06T07:40:31.042315Z [info     ] jsonschema.exceptions.ValidationError: 5000.0 is not of type 'string', 'null' cmd_type=elb consumer=True name=target-s3-parquet producer=False stdio=stderr string_id=target-s3-parquet
2022-12-06T07:40:31.042432Z [info     ]                                cmd_type=elb consumer=True name=target-s3-parquet producer=False stdio=stderr string_id=target-s3-parquet
2022-12-06T07:40:31.042547Z [info     ] Failed validating 'type' in schema['properties']['Relamt']: cmd_type=elb consumer=True name=target-s3-parquet producer=False stdio=stderr string_id=target-s3-parquet
2022-12-06T07:40:31.042658Z [info     ]     {'type': ['string', 'null']} cmd_type=elb consumer=True name=target-s3-parquet producer=False stdio=stderr string_id=target-s3-parquet
2022-12-06T07:40:31.042775Z [info     ]                                cmd_type=elb consumer=True name=target-s3-parquet producer=False stdio=stderr string_id=target-s3-parquet
2022-12-06T07:40:31.042885Z [info     ] On instance['Relamt']:         cmd_type=elb consumer=True name=target-s3-parquet producer=False stdio=stderr string_id=target-s3-parquet
2022-12-06T07:40:31.042994Z [info     ]     5000.0                     cmd_type=elb consumer=True name=target-s3-parquet producer=False stdio=stderr string_id=target-s3-parquet
2022-12-06T07:40:31.043123Z [info     ] {"type": "RECORD", "stream": "dbo-Ledger1", "record": {"Bnkname": "dsfg", "Brnname": "", "dfg": "C", "Ddno": "", "Dddt": "2016-12-23T14:00:15.273000+00:00", "Reldt": "2016-12-23T00:00:00+00:00", "Relamt": 5000.0, "Refno": "            ", "Receiptno": 0, "Vtyp": 9, "Vno": "2345", "Lno": 2, "Drcr": "C", "Booktype": "01", "Micrno": 0, "Slipno": 0, "Slipdate": "1900-01-01T00:00:00+00:00", "Chequeinname": "Test Check", "Chqprinted": 0, "Clear_mode": " ", "L1_SNo": 108}, "time_extracted": "2022-12-06T07:40:31.002550+00:00"} cmd_type=elb consumer=False name=tap-mssql producer=True stdio=stdout string_id=tap-mssql (edited)

BuzzCutNorman commented 2 years ago

I created a reproduction script to create a small test table to test against. here is the script I used.

use [testdata]
go
/*********************************
Create the simple test table
in an MSSQL database
*********************************/
DROP TABLE IF EXISTS [MoneyTrouble];
CREATE TABLE MoneyTrouble (
    Id int IDENTITY(1,1) PRIMARY KEY,
    Item varchar(255) NOT NULL,
    Realmt MONEY,
);
DROP TABLE IF EXISTS [SmallMoneyTrouble];
CREATE TABLE SmallMoneyTrouble (
    Id int IDENTITY(1,1) PRIMARY KEY,
    Item varchar(255) NOT NULL,
    Realmt MONEY,
);
go
/*********************************
Insert some test data into the
test table
*********************************/
INSERT INTO [testdata].[dbo].[MoneyTrouble]
           ([Item]
           ,[Realmt])
     VALUES
           ('Park Ticket', 500.00)
           ,('Churro', 20.00)
           ,('Lunch', 134.48)
           ,('Character Signature', NULL)
;
INSERT INTO [testdata].[dbo].[SmallMoneyTrouble]
           ([Item]
           ,[Realmt])
     VALUES
           ('Water', 5.89)
           ,('Sticker', 2.35)
           ,('Candy', .75)
           ,('Taking a break', NULL)
;
go
/********************************
Query the new test tables
********************************/
select * from MoneyTrouble;
select * from SmallMoneyTrouble;

BuzzCutNorman commented 2 years ago

After looking at this I found that the function singer-sdk.typing.to_jsonschema_type() is correctly choosing a jasonscheme data type of string as a default. Unfortunately, the money amount does not have quotes around it in the output messages since sqlalchmeny sees it as a numeric.

{"Id": 1, "Item": "Park Ticket", "Realmt": 500.0000} .

I agree with @cwegener suggestion via slack to type this as a number JSON type. I found that number doesn't have a concept of scale. This means anything ending in zero may get stripped an example would be 1.00 or 1.10 will become 1.0 and 1.1. I also saw an instance in which a target stripped off all the decimals and rounded up so 0.75 became 1.

Despite my findings during testing, I still like MONEY and SMALLMONEY turning into a JSON number since this matches up with how SQLAlchemy deals with MONEY and SMALLMONEY data types.

cwegener commented 2 years ago

Would a fixed scale using "multipleOf": 0.01 (or "multipleOf": 0.0001) work? That's what Rob is doing as well. https://github.com/wintersrd/pipelinewise-tap-mssql/blob/e61e8687a552c8a6c67946b1c90f44745da90cb9/tests/test_tap_mssql.py#L96-L99

On a larger scale, a general decimal precision functionality for number types could probably be provided by the SDK.

cwegener commented 2 years ago

On second thought, I don't think multipleOf actually solves for the specific problem you described.

BuzzCutNorman commented 2 years ago

Would a fixed scale using "multipleOf": 0.01 (or "multipleOf": 0.0001) work? That's what Rob is doing as well.

I came across that in a StackOverflow example and gave it a try. The test amount of 134.48 kept blowing it up with a not a multiple of 0.0001 vaidation error. It looked like the same error Harshit mention they had gotten when trying the wintersrd tap-mssql so I abandoned adding it.

cwegener commented 2 years ago

Would a fixed scale using "multipleOf": 0.01 (or "multipleOf": 0.0001) work? That's what Rob is doing as well.

I came across that in a StackOverflow example and gave it a try. The test amount of 134.48 kept blowing it up with a not a multiple of 0.0001 vaidation error. It looked like the same error Harshit mention they had gotten when trying the wintersrd tap-mssql so I abandoned adding it.

Yup. I think the more appropriate topic is this one: https://github.com/json-schema-org/json-schema-vocabularies/issues/45

I completely glanced over the fact that multipleOf only has the purpose of checking that the data already has the right scale, which is not the problem you had.

BuzzCutNorman commented 2 years ago

On second thought, I don't think multipleOf actually solves for the specific problem you described.

We lose some things in the process of going from SQL -> Python -> JSON -> Python -> SQL . In this instance we lose the precision and scale information. I think I traced the target side back to this line in the singer-sdk.typing.to_sql_type() function.

if _jsonschema_type_check(jsonschema_type, ("number",)):
        return cast(sqlalchemy.types.TypeEngine, sqlalchemy.types.DECIMAL())

If we had the precision and scale in the discovery schema, we could add it back in.

BuzzCutNorman / tap-mssql

bug: Data type of MONEY not handled properly #8