Closed elgrangrifon closed 4 years ago
I tired the parquetizer with data from Japan: http://download.geofabrik.de/asia/japan.html For all regions, the array value in the tags are outputted as Binaries instead of strings. Transforming the nested dataframes is very tricky.
Is there a way to supply a working Python / Scala Schema for creating dataframes from the parequet files or fixing the issue at the sources?
I think binary array is okay, it saves memory and you can easily convert them to string using rdd.map
or dataframe UDF
in spark. For example:
val tagsStringMap = tags.map(x =>{
(new String(x._1), new String(x._2))
})
Hi @elgrangrifon,
If you are using Spark, then the easiest way avoid to this inconvenience is to use the following configuration:
sqlContext.setConf("spark.sql.parquet.binaryAsString", "true")
.
Let us know if this works for you. Changing the actual type will introduce a change which is not backward compatible.
I tired the parquetizer with data from Japan: http://download.geofabrik.de/asia/japan.html For all regions, the array value in the tags are outputted as Binaries instead of strings. Transforming the nested dataframes is very tricky.
Is there a way to supply a working Python / Scala Schema for creating dataframes from the parequet files or fixing the issue at the sources?