databricks / spark-redshift

Redshift data source for Apache Spark
Apache License 2.0
606 stars 348 forks source link

Dealing with StringType to varchar(256) limit in python #312

Open karanveerm opened 7 years ago

karanveerm commented 7 years ago

Hi, I have seen #118 and also read the documentation which says

Note: Due to limitations in Spark, metadata modification is unsupported in 
the Python, SQL, and R language APIs.

While I understand this limitation, I'm wondering how a python user should deal with dataframes that have text columns with values exceeding 256 characters.

Im trying to save a dataframe to redshift where a column of type string has entries as large as 1000 characters and any hacks / workarounds will be appreciated :)

JoshRosen commented 7 years ago

There's probably some horrible hack to call into the Java version of the API through Py4J then to wrap the result back into a Python schema or DataFrame.

karanveerm commented 7 years ago

Hm, okay, I'll try to see if I can find a hack that works using py4j for now

Do you think it might make sense for the spark-redshift library to support this via the parameter map? Perhaps something as simple as varchar_length (which would be 256 by default) that would apply to all string columns?

schnie commented 7 years ago

@karanveerm did you ever find a workaround to this?

ThomRogers commented 7 years ago

I have the same problem as @karanveerm . HAs anybody solved this yet?

EbadAE commented 7 years ago

Same problem here. Any working solution would be appreciated.

Wendtaments commented 6 years ago

Same problem...

pallavi commented 5 years ago

spark-redshift now lets you do this in version runtime-3.0. here's an example: https://gist.github.com/pallavi/f83a45308ba8387f6b227c28aa209077

ar-ms commented 4 years ago

I tried using the createTableColumnTypes option, it seems to work with postgres but not with spark-redshift which seems to ignore the option. The solution provided by @pallavi is working perfectly.