MeltanoLabs / Singer-Working-Group

Working group for ongoing development and iteration of the Singer Spec, the de-facto protocol for open source data connectors. Please use "Issues" to create discussion items - or use "Discussions" for general questions.
Apache License 2.0
13 stars 4 forks source link

Handling Decimal / Numeric Data - Singer.Decimal Standards #32

Open s7clarke10 opened 1 year ago

s7clarke10 commented 1 year ago

There has been some discussion around how to handle Floats, Decimal and Numeric data.

We were finding that decimal data was ending up as floats in target-snowflake from various database taps. The taps were opinionated using the MultipleOf etc, but the results were in the wrong data type and we were finding there was truncation / rounding. Part of the issue was to do with the Python Language not supporting large numeric data leading to truncation.

We have learned about the singer.decimal format which doesn't take an opinion on the output format, leaving the target to determine how to treat the string representation of the numeric data. The problem with this is you loose a lot of automation, like the automated creation of tables or addition of columns with the correct datatype.

The problem is there needs to be further information provided to allow the correct creation of datatypes by the target as part of the schema message i.e. information about the precision and scale if it is available at source.

We have elected to add some AdditionalProperties - providing the Scale and Precision to help with decimal data. See the slack message below.

https://app.slack.com/client/TFG99TU9K/search/search-eyJkIjoiZGVjaW1hbCUyMGRhdGEiLCJxIjoiVTAyQjMxUUJTVUIiLCJyIjoiZGVjaW1hbCUyMGRhdGEifQ==/thread/C013EKWA2Q1-1650990076.383639

While we have taken an opinionate view on how to implement this, it would be great if there was some standards around how to navigate the challenges of emitting numeric data without data loss. We would also be interested in views around our approach to dealing with this issue.

mjsqu commented 1 year ago

Just to add to @s7clarke10's post, as a colleague working on the same platforms. The way we have our stack configured, singer.decimal treats values at each stage in the following ways: