adrianulbona / osm-parquetizer

A converter for the OSM PBFs to Parquet files
http://adrianulbona.github.io/2016/12/18/osm-parquetizer.html
Apache License 2.0
92 stars 32 forks source link

Tag Array outputs to Binary #9

Closed elgrangrifon closed 4 years ago

elgrangrifon commented 4 years ago

I tired the parquetizer with data from Japan: http://download.geofabrik.de/asia/japan.html For all regions, the array value in the tags are outputted as Binaries instead of strings. Transforming the nested dataframes is very tricky.

Is there a way to supply a working Python / Scala Schema for creating dataframes from the parequet files or fixing the issue at the sources?

ericsun95 commented 4 years ago

I tired the parquetizer with data from Japan: http://download.geofabrik.de/asia/japan.html For all regions, the array value in the tags are outputted as Binaries instead of strings. Transforming the nested dataframes is very tricky.

Is there a way to supply a working Python / Scala Schema for creating dataframes from the parequet files or fixing the issue at the sources?

I think binary array is okay, it saves memory and you can easily convert them to string using rdd.map or dataframe UDF in spark. For example:

val tagsStringMap = tags.map(x =>{
     (new String(x._1), new String(x._2))
})
adrianulbona commented 4 years ago

Hi @elgrangrifon,

If you are using Spark, then the easiest way avoid to this inconvenience is to use the following configuration: sqlContext.setConf("spark.sql.parquet.binaryAsString", "true").

Let us know if this works for you. Changing the actual type will introduce a change which is not backward compatible.