apache / sedona

A cluster computing framework for processing large-scale geospatial data
https://sedona.apache.org/
Apache License 2.0
1.96k stars 695 forks source link

There was garbled code when reading Chinese from shp file #1480

Closed yancy-hong closed 4 months ago

yancy-hong commented 4 months ago

Expected behavior

Successfully read Chinese

Actual behavior

Garbled code

Steps to reproduce the problem

Partial code(Java):

public static void main(String[] args) {
      SparkSession sedona = SedonaContext.builder()
                    .master("local")
                    .appName("test")
                    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
                    .config("spark.kryo.registrator", SedonaVizKryoRegistrator.class.getName())
                    .config("spark.driver.extraJavaOptions", "-Dsedona.global.charset=utf8")
                    .config("spark.executor.extraJavaOptions", "-Dsedona.global.charset=utf8")
                    .getOrCreate();
      SedonaContext.create(sedona);
      SpatialRDD<Geometry> rdd = ShapefileReader.readToGeometryRDD(JavaSparkContext.fromSparkContext(sedona.sparkContext()),
                                                                  "F:\\test_file\\test_shp");
      Dataset<Row> df = Adapter.toDf(rdd , sedona);
      df.show(1);
}

I followed the example from the official website 'API Docs -> Sedona with Apache Spark -> Vector data -> Constructor -> Read ESRI Shapefile -> SparkSQL example' to read a shp file. This shp file is able to display Chinese characters properly in GIS software, but when sedona reads the data, whether it's printed to the console or written to a file, the Chinese part is displayed as garbled text. Additionally, when I use 'printSchema()', I noticed that all the 'Column' types are shown as string, whereas my shp file contains data types other than just string. Since all the 'Column' types are displayed as string, I found that the decimal numbers have been transformed into numbers in scientific notation. Is it because I performed the operation incorrectly, or is there another reason? Please advise, thank you!

Settings

Sedona version = ? 1.5.1

Apache Spark version = ? 3.5.0

Apache Flink version = ?

API type = Scala, Java, Python? Java

Scala version = 2.11, 2.12, 2.13? 2.13

JRE version = 1.8, 1.11? 1.8

Python version = ?

Environment = Standalone, AWS EC2, EMR, Azure, Databricks? Standalone,local

yancy-hong commented 4 months ago

I found a solution to the issue of garbled text in issue #190. Adding the code:

System.setProperty("sedona.global.charset","utf8"); 

solved the problem, but the configuration mentioned on the official website using:

spark.driver.extraJavaOptions -Dsedona.global.charset=utf8 
spark.executor.extraJavaOptions -Dsedona.global.charset=utf8 

did not work.

Kontinuation commented 4 months ago

spark.driver.extraJavaOptions works when submitting the Spark application using spark-submit or using PySpark (https://github.com/apache/sedona/issues/1345), it will alter the Java system properties of the newly spawned Spark driver process. If you are running the main function in local mode all by yourself, System.setProperty is the proper way of setting Java system properties.

Although System.setProperty works in this local setup, it does not work when submitting the Spark application to a cluster. The DBF files are parsed by executors, and calling System.setProperty on the driver won't alter the Java system properties of executors.

The columns of the DataFrame converted from SpatialRDD object are all strings, this is the shortcoming of how we parse DBF files and hold user data in SpatialRDD. The attributes in DBF files are all converted to strings and we don't keep track of their original data types in SpatialRDD. A more proper way to support Shapefiles is by implementing a Shapefile reader based on Spark DataSourceV2, which directly loads Shapefiles as DataFrames.

yancy-hong commented 4 months ago

spark.driver.extraJavaOptions works when submitting the Spark application using spark-submit or using PySpark (#1345), it will alter the Java system properties of the newly spawned Spark driver process. If you are running the main function in local mode all by yourself, System.setProperty is the proper way of setting Java system properties.

Although System.setProperty works in this local setup, it does not work when submitting the Spark application to a cluster. The DBF files are parsed by executors, and calling System.setProperty on the driver won't alter the Java system properties of executors.

The columns of the DataFrame converted from SpatialRDD object are all strings, this is the shortcoming of how we parse DBF files and hold user data in SpatialRDD. The attributes in DBF files are all converted to strings and we don't keep track of their original data types in SpatialRDD. A more proper way to support Shapefiles is by implementing a Shapefile reader based on Spark DataSourceV2, which directly loads Shapefiles as DataFrames.

Thanks!