datamill-co / target-redshift

A Singer.io Target for Redshift
MIT License
23 stars 17 forks source link

Support for Redshift types like NUMERIC(20, 0) #39

Open joeschmid opened 4 years ago

joeschmid commented 4 years ago

Thanks for the work on this project! We're just trying out Singer for moving data from MySQL to Redshift. In MySQL we have a column type of bigint(18) unsigned. Some values in this column don't fit it Redshift's bigint column type and we get errors like Overflow (Long valid range -9223372036854775808 to 9223372036854775807)

Typically we declare a Redshift column as NUMERIC(20, 0) to hold these values. Is there a way to tell target-redshift to use that type for a particular Redshift column?

AlexanderMann commented 4 years ago

@joeschmid thanks for the kind words! We're always looking to make Target-Redshift better, so we really appreciate questions like this.

There is currently no supported way to do what you're asking. There have been conversations in the past about building up tooling to detect data widths so that we can leverage tighter constraints inside Redshift and avoid penalties for things like TEXT columns everywhere, instead of VARCHAR(20), etc.

There is some work coming down the pipe which will make a number of these improvements simpler in the future, but what the "future" here means is pretty up in the air.

Given this, I don't think the most expedient way for you to resolve your is to wait out for this feature.

I'd be happy to help walk you through what changes I would expect you'd need to make to get things working if that's useful to you?

joeschmid commented 4 years ago

@AlexanderMann thanks very much for the update and explanation. That all makes sense. If you wouldn't mind walking through the changes to get this scenario working I'd appreciated it. (And maybe any others who come across similar issues would see the explanation here and it would help them out.)

AlexanderMann commented 4 years ago

@joeschmid no problem. So I will start by saying that the way to "get this working" is to fork this repo, and start trying to get what you're after working. I'm also not sure if it'll "work" or end up being a 🐰 🕳

Worth noting, Stitch also doesn't "support" this: https://www.stitchdata.com/docs/destinations/redshift/#data-limits

Integer range 9223372036854775808 to 9223372036854775807 Integer values outside of this range will be rejected and logged in the _sdc_rejected table.

Easiest Option

Make all integers NUMERIC(0, 20)

Pros

Prolly be straightforward and simple.

Cons

Column widths will balloon for all integers. Redshift (last I checked) uses the full width for a column for all values in the column, whereas PostgreSQL uses the width of the data in the row to consume memory.

Changes

In these lines, you're just going to make a mapping for JSONSchema's integer type to Redshift's NUMERIC(0,20): https://github.com/datamill-co/target-redshift/blob/master/target_redshift/redshift.py#L97-L118

For more examples of what that'd look like, check in here: https://github.com/datamill-co/target-postgres/blob/master/target_postgres/postgres.py#L806-L870

awm33 commented 4 years ago

@joeschmid I'm not sure if you resolved this, but a hack (and for anyone looking this issue) would be to create a view where that column is a text/string type then use a SQL transform to parse that into a custom numeric type after replication.