apache / sedona

A cluster computing framework for processing large-scale geospatial data
https://sedona.apache.org/
Apache License 2.0
1.88k stars 663 forks source link

Babylon distributed image compile error #80

Closed geoHeil closed 7 years ago

geoHeil commented 7 years ago

I get the following compile error in case Babylon distributed images are used

overloaded method value SaveAsFile with alternatives:
[error]   (x$1: java.util.List[String],x$2: String,x$3: org.datasyslab.babylon.utils.ImageType)Boolean <and>
[error]   (x$1: java.awt.image.BufferedImage,x$2: String,x$3: org.datasyslab.babylon.utils.ImageType)Boolean <and>
[error]   (x$1: org.apache.spark.api.java.JavaPairRDD,x$2: String,x$3: org.datasyslab.babylon.utils.ImageType)Boolean
[error]  cannot be applied to (org.apache.spark.api.java.JavaPairRDD[Integer,String], String, org.datasyslab.babylon.utils.ImageType)
[error]     imageGenerator.SaveAsFile(visualizationOperator.distributedVectorImage, outputPath, ImageType.SVG)

e.g. the following using simply vector images compiles fine:

def buildScatterPlot(outputPath: String, spatialRDD: SpatialRDD): Boolean = {
    val envelope = spatialRDD.boundaryEnvelope
    val s = spatialRDD.getRawSpatialRDD.rdd.sparkContext
    val visualizationOperator = new ScatterPlot(7000, 4900, envelope, false, -1, -1, true, true)
    visualizationOperator.CustomizeColor(255, 255, 255, 255, Color.GREEN, true)
    visualizationOperator.Visualize(s, spatialRDD)
    import org.datasyslab.babylon.utils.ImageType
    imageGenerator.SaveAsFile(visualizationOperator.vectorImage, outputPath, ImageType.SVG)
  }

@jiayuasu please could you explain a bit if distributedvevctorimage is required in case the boolean option for parallel image rendering / filter was selected.

geoHeil commented 7 years ago

Also I experience null pointer exceptions when using the regular rasterImage when using it as outlined here https://gist.github.com/geoHeil/dbef714e2254956840832ebaabf12a07

jiayuasu commented 7 years ago

@geoHeil

new ScatterPlot(7000, 4900, envelope, false, -1, -1, false, true)

Since you set the last parameter "generateVectorImage" as true, you have to use vectorImage and store it as SVG format.

imageGenerator.SaveAsFile(visualizationOperator.vectorImage, outputPath, ImageType.SVG)

In addition, for generating raster image, you don't have to use super high resolution, 1000*600 will good enough.

jiayuasu commented 7 years ago

@geoHeil I know Babylon APIs are complicated. Trying to figure out a better API structure.

geoHeil commented 7 years ago

I will try vectors then. But my initial attempts did fail there as well. Regarding the compile error reported above when using the distributed version can you comment here as well?

Is it correct to assume that I'm order for distributed rendering to work I need to set the appropriate boolean flags to true as well? I.e. It is not enough to request distributed rendering from the visualization operator. Jia Yu notifications@github.com schrieb am Di. 25. Apr. 2017 um 02:48:

@geoHeil https://github.com/geoHeil I know Babylon APIs are complicated. Trying to figure out a better API structure.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DataSystemsLab/GeoSpark/issues/80#issuecomment-296860263, or mute the thread https://github.com/notifications/unsubscribe-auth/ABnc9I8T63UWvWDHNHVyXKhzqDmB-xgVks5rzUKwgaJpZM4NFcqz .

geoHeil commented 7 years ago

to clarify

def parallelFilterRenderStitch(outputPath: String, spatialRDD: SpatialRDD): Boolean = {
    val s = spatialRDD.getRawSpatialRDD.rdd.sparkContext
    val visualizationOperator = new HeatMap(1000, 600, spatialRDD.boundaryEnvelope, false, 2, -1, -1, false, true)
    visualizationOperator.Visualize(s, spatialRDD)
    visualizationOperator.stitchImagePartitions
    imageGenerator.SaveAsFile(visualizationOperator.distributedRasterImage, outputPath, ImageType.PNG)
  }

fails with the following compile error

aveAsFile with alternatives:
[error]   (x$1: java.util.List[String],x$2: String,x$3: org.datasyslab.babylon.utils.ImageType)Boolean <and>
[error]   (x$1: java.awt.image.BufferedImage,x$2: String,x$3: org.datasyslab.babylon.utils.ImageType)Boolean <and>
[error]   (x$1: org.apache.spark.api.java.JavaPairRDD,x$2: String,x$3: org.datasyslab.babylon.utils.ImageType)Boolean
[error]  cannot be applied to (org.apache.spark.api.java.JavaPairRDD[Integer,org.datasyslab.babylon.core.ImageSerializableWrapper], String, org.datasyslab.babylon.utils.ImageType)
[error]     imageGenerator.SaveAsFile(visualizationOperator.distributedRasterImage, outputPath, ImageType.PNG)

Please could you explain the last couple of parameters: -1, -1, false, false shouldn't the number of partitions be inferred automatically? Is it correct / mandatory to set the last ones to true (assuming the compile error is fixed) in order to perform distributed rendering for speed increase?

jiayuasu commented 7 years ago

@geoHeil , the partitions on X and Y in parallel filtering and parallel rendering should be all > 0 (e.g., 2, 2) if you set the corresponding two boolean as true.

This will generate distribute raster image. You can use Spark image generator to store it onto hdfs/S3 or other Spark friendly storage. You also can use NativeJavaImageGenerator to store the distributed raster image onto local file system. It will generate a bunch of image tiles. If you choose to stitch the tiles, the stitch function will generate a raster image and store it as rasteImage.

I suggest you

  1. start with my Babylon runnable demo. You can clone GeoSpark repository and directly run that java file. Later on you can switch to Scala.
  1. Then you can try to play with RasterImage/distributed RasterImage first. Use the simplest Scatter Plot. Store them on local file system.

  2. Then you can try HeatMap raster image and Choropleth Map raster image. Please use PNG format for all cases. I am noticing sometimes the GIF does not work well.

  3. Finally, you can try the vector image format and store on local file system. Note that, currently vector image only available for Scatter Plot and Choropleth Map.

In addition, the difference between Spark Image generator and Java Image Generator is that:

  1. the former can store distributed rasterImageRDD/vectorImageRDD to Spark friendly storage such as S3/HDFS/LocalFileSystem using binary format. This way is scalable but you may not be able to read the persisted image outside Spark because they are in Spark binary format.
  2. The latter one can only store distributed or single raster/vector image onto your local file system but you are able to access/view the image using regular image viewer.
geoHeil commented 7 years ago

Thanks. Regarding partitions is there any recommend size or like regular spark 2-3* number of cpus? Jia Yu notifications@github.com schrieb am Mi. 26. Apr. 2017 um 00:54:

@geoHeil https://github.com/geoHeil , the partitions on X and Y in parallel filtering and parallel rendering should be all > 0 (e.g., 2, 2) if you set the corresponding two boolean as true.

This will generate distribute raster image. You can use Spark image generator to store it onto hdfs/S3 or other Spark friendly storage. You also can use NativeJavaImageGenerator to store the distributed raster image onto local file system. It will generate a bunch of image tiles. If you choose to stitch the tiles, the stitch function will generate a raster image and store it as rasteImage.

I suggest you

1.

start with my Babylon runnable demo. You can clone GeoSpark repository and directly run that java file. Later on you can switch to Scala. 2.

Then you can try to play with RasterImage/distributed RasterImage first. Use the simplest Scatter Plot. Store them on local file system. 3.

Then you can try HeatMap raster image and Choropleth Map raster image. Please use PNG format for all cases. I am noticing sometimes the GIF does not work well. 4.

Finally, you can try the vector image format and store on local file system.

In addition, the difference between Spark Image generator and Java Image Generator is that:

  1. the former can store distributed rasterImageRDD/vectorImageRDD to Spark friendly storage such as S3/HDFS/LocalFileSystem using binary format. This way is scalable but you may not be able to read the persisted image outside Spark because they are in Spark binary format.
  2. The latter one can only store distributed or single raster/vector image onto your local file system but you are able to access/view the image using regular image viewer.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DataSystemsLab/GeoSpark/issues/80#issuecomment-297188282, or mute the thread https://github.com/notifications/unsubscribe-auth/ABnc9FQujLZmVtAjuz9OsEAXSV0AK3PEks5rznlcgaJpZM4NFcqz .

geoHeil commented 7 years ago

@jiayuasu, I created the babylon example in scala here as well: https://github.com/geoHeil/geoSparkScalaSample/blob/master/src/main/scala/myOrg/visualization/VisualizationGeosparkLocalRaster.scala

With the visualization implementation https://github.com/geoHeil/geoSparkScalaSample/blob/master/src/main/scala/myOrg/visualization/Vis.scala

As you will see https://github.com/geoHeil/geoSparkScalaSample/blob/master/src/main/scala/myOrg/visualization/Vis.scala#L42-L61 the compile error overloaded method value SaveAsFile with alternatives is still there.

Unfortunately, your sample data from src/test/resources/ will trigger a IllegalArgumentException: Points of LinearRing do not form a closed linestring Exception. To reproduce simply

git clone https://github.com/geoHeil/geoSparkScalaSample.git
cd geoSparkScalaSample
sbt run
# when propmpted for multiple main classes select 2 

I will try to ask some Scala experts regarding the compile issue. Maybe you could have a look at the input files.

geoHeil commented 7 years ago

@jiayuasu: Unfortunately, I think there is a bug in the latest geospark version. Causing Points of LinearRing do not form a closed linestring for a dataset which worked finde with older versions.

geoHeil commented 7 years ago

@jiayuasu regarding the original problem: http://stackoverflow.com/questions/43626048/convert-java-to-scala-code-change-of-method-signatures/43630473#43630473 you are using Raw type parameters. Do you have any plans to use regular generics as suggested from the answer?

jiayuasu commented 7 years ago

@geoHeil , thanks for the great information! For the two bugs you've mentioned, 1. How can I reproduce the first one (do not form a closed linestring)? It is weird and probably GeoSpark unit tests don't cover it. 2. I intend to make JavaPairRDD not type safe due to some reasons. But now I realize it is not wise.

I will release a patch to fix the second bug very soon. Also please tell me how to reproduce the "LinearRing do not closed" bug.

To summarize, this Babylon bug happens when users use Babylon API to visualize distribute image RDD in Scala

geoHeil commented 7 years ago

@jiayuasu thanks. I will create a separate issue for the line string problem. https://github.com/DataSystemsLab/GeoSpark/issues/83

jiayuasu commented 7 years ago

Hi @geoHeil ,

This issue should have been solved in Babylon 0.1.2-snapshot. I have deprecated all old image generators and add a new "BabylonImageGenerator" to replace all old generator APIs. The new "BabylonImageGenerator" has new APIs which are easier to understand.

Please refer to the latest Babylon Java example and try it out in Scala. I think you just need to replace the old imageGenerator part.

geoHeil commented 7 years ago

Hi @jiayuasu , thanks for the quick response. The new API is much nicer. Though, it would be great if similar to df.write.mode(SaveMode.Overwrite).parquet(path) you would automatically allow to overwrite output in case of distributed images.

However, Still get a IllegalArgumentException: image == null! in case of distributed raster image and saving to local:


val vDistributedRaster = new ScatterPlot(1000, 600, USMainLandBoundary, false, 2, 2, true, false)
  vDistributedRaster.CustomizeColor(255, 255, 255, 255, Color.GREEN, true)
  vDistributedRaster.Visualize(spark.sparkContext, spatialRDD)
  val imageGenerator = new BabylonImageGenerator()
  imageGenerator.SaveRasterImageAsLocalFile(vDistributedRaster.distributedRasterImage, scatterPlotOutputPath + "distributedRaster", ImageType.PNG)

java.lang.IllegalArgumentException: image == null!                              
  at javax.imageio.ImageTypeSpecifier.createFromRenderedImage(ImageTypeSpecifier.java:925)
  at javax.imageio.ImageIO.getWriter(ImageIO.java:1592)
  at javax.imageio.ImageIO.write(ImageIO.java:1520)
  at org.datasyslab.babylon.extension.imageGenerator.BabylonImageGenerator.SaveRasterImageAsLocalFile(BabylonImageGenerator.java:35)
  at org.datasyslab.babylon.core.AbstractImageGenerator.SaveRasterImageAsLocalFile(AbstractImageGenerator.java:59)
  ... 42 elided
geoHeil commented 7 years ago

I could track down the problem new HeatMap(7000, 4900, envelope, false, 1, 2, 2, true, true) should render a distributed image, but shows the null pointer from above, and new HeatMap(7000, 4900, envelope, false, 2) works just fine

Where some more context is provided here. The commented out functions are the ones with the null pointer.

def buildHeatMap(outputPath: String, spatialRDD: SpatialRDD, envelope: Envelope): Boolean = {
    val s = spatialRDD.getRawSpatialRDD.rdd.sparkContext
    // TODO strange overhead for distributed image rendering. No task scheduled for 2min before something happens.
//    val visualizationOperator = new HeatMap(7000, 4900, envelope, false, 1, 2, 2, true, true)
        val visualizationOperator = new HeatMap(7000, 4900, envelope, false, 2)
    visualizationOperator.Visualize(s, spatialRDD)
    //    val imageGenerator = new BabylonImageGenerator
        imageGenerator.SaveRasterImageAsLocalFile(visualizationOperator.rasterImage, outputPath, ImageType.PNG)
//    imageGenerator.SaveRasterImageAsLocalFile(visualizationOperator.distributedRasterImage, outputPath, ImageType.PNG)
  }
geoHeil commented 7 years ago

Though, still it seems to work https://github.com/DataSystemsLab/GeoSpark/blob/master/babylon/src/main/scala/org/datasyslab/geospark/showcase/ScalaExample.scala not sure what is different. As it must be an issue on my side I will close the issue.

jiayuasu commented 7 years ago

@geoHeil OK, thanks. I will investigate more and optimize Babylon performance.