Closed yancy-hong closed 4 months ago
I found a solution to the issue of garbled text in issue #190. Adding the code:
System.setProperty("sedona.global.charset","utf8");
solved the problem, but the configuration mentioned on the official website using:
spark.driver.extraJavaOptions -Dsedona.global.charset=utf8
spark.executor.extraJavaOptions -Dsedona.global.charset=utf8
did not work.
spark.driver.extraJavaOptions
works when submitting the Spark application using spark-submit
or using PySpark (https://github.com/apache/sedona/issues/1345), it will alter the Java system properties of the newly spawned Spark driver process. If you are running the main function in local mode all by yourself, System.setProperty
is the proper way of setting Java system properties.
Although System.setProperty
works in this local setup, it does not work when submitting the Spark application to a cluster. The DBF files are parsed by executors, and calling System.setProperty
on the driver won't alter the Java system properties of executors.
The columns of the DataFrame converted from SpatialRDD object are all strings, this is the shortcoming of how we parse DBF files and hold user data in SpatialRDD. The attributes in DBF files are all converted to strings and we don't keep track of their original data types in SpatialRDD. A more proper way to support Shapefiles is by implementing a Shapefile reader based on Spark DataSourceV2, which directly loads Shapefiles as DataFrames.
spark.driver.extraJavaOptions
works when submitting the Spark application usingspark-submit
or using PySpark (#1345), it will alter the Java system properties of the newly spawned Spark driver process. If you are running the main function in local mode all by yourself,System.setProperty
is the proper way of setting Java system properties.Although
System.setProperty
works in this local setup, it does not work when submitting the Spark application to a cluster. The DBF files are parsed by executors, and callingSystem.setProperty
on the driver won't alter the Java system properties of executors.The columns of the DataFrame converted from SpatialRDD object are all strings, this is the shortcoming of how we parse DBF files and hold user data in SpatialRDD. The attributes in DBF files are all converted to strings and we don't keep track of their original data types in SpatialRDD. A more proper way to support Shapefiles is by implementing a Shapefile reader based on Spark DataSourceV2, which directly loads Shapefiles as DataFrames.
Thanks!
Expected behavior
Successfully read Chinese
Actual behavior
Garbled code
Steps to reproduce the problem
Partial code(Java):
I followed the example from the official website 'API Docs -> Sedona with Apache Spark -> Vector data -> Constructor -> Read ESRI Shapefile -> SparkSQL example' to read a shp file. This shp file is able to display Chinese characters properly in GIS software, but when sedona reads the data, whether it's printed to the console or written to a file, the Chinese part is displayed as garbled text. Additionally, when I use 'printSchema()', I noticed that all the 'Column' types are shown as string, whereas my shp file contains data types other than just string. Since all the 'Column' types are displayed as string, I found that the decimal numbers have been transformed into numbers in scientific notation. Is it because I performed the operation incorrectly, or is there another reason? Please advise, thank you!
Settings
Sedona version = ? 1.5.1
Apache Spark version = ? 3.5.0
Apache Flink version = ?
API type = Scala, Java, Python? Java
Scala version = 2.11, 2.12, 2.13? 2.13
JRE version = 1.8, 1.11? 1.8
Python version = ?
Environment = Standalone, AWS EC2, EMR, Azure, Databricks? Standalone,local