apache / sedona

A cluster computing framework for processing large-scale geospatial data
https://sedona.apache.org/
Apache License 2.0
1.96k stars 693 forks source link

issues loading geoJSON #97

Closed dcalacci closed 7 years ago

dcalacci commented 7 years ago

Hi, thank you for building such a useful tool!

I'm trying hard to use geospark in a simple project. I want to do a spatial join on some point data (which I can load fine from a set of CSVs) and US Census Block Group data.

I am having trouble loading the polygon data from the block groups into geospark.

I first use ogr2ogr to convert the available block group shapefiles. The original shp files were downloaded from here.

  1. convert to geojson:

    ogr2ogr -f "GEOJSON" bg_geo.json cb_2016_25_bg_500k.shp
  2. try to load using geospark:

    import org.datasyslab.geospark.enums.FileDataSplitter
    import org.datasyslab.geospark.spatialRDD.PolygonRDD
    import org.apache.spark.storage.StorageLevel
    val storageLevel = new StorageLevel()
    val acs_rdd = new PolygonRDD(sc, "s3://mixing/raw/acs_bg_geo.json", FileDataSplitter.GEOJSON, false, 5, storageLevel);

Expected Behavior The geoJSON is loaded into the polygon RDD properly.

Actual I get the following error. I am able to load this geojson in other tools. I am pretty new to scala/spark/geospark, so perhaps I'm making a simple mistake? Any help is appreciated!

java.lang.RuntimeException: com.fasterxml.jackson.core.JsonParseException: Unexpected end-of-input: expected close marker for OBJECT (from [Source: {; line: 1, column: 0])
 at [Source: {; line: 1, column: 3]
    at org.wololo.geojson.GeoJSONFactory.create(GeoJSONFactory.java:31)
    at org.wololo.jts2geojson.GeoJSONReader.read(GeoJSONReader.java:16)
    at org.datasyslab.geospark.formatMapper.PolygonFormatMapper.call(PolygonFormatMapper.java:102)
    at org.datasyslab.geospark.formatMapper.PolygonFormatMapper.call(PolygonFormatMapper.java:31)
    at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:125)
    at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:125)
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:363)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:973)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
    at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected end-of-input: expected close marker for OBJECT (from [Source: {; line: 1, column: 0])
 at [Source: {; line: 1, column: 3]
    at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1581)
    at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:533)
    at com.fasterxml.jackson.core.base.ParserMinimalBase._reportInvalidEOF(ParserMinimalBase.java:470)
    at com.fasterxml.jackson.core.base.ParserBase._handleEOF(ParserBase.java:501)
    at com.fasterxml.jackson.core.base.ParserBase._eofAsNextChar(ParserBase.java:509)
    at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:2018)
    at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextFieldName(ReaderBasedJsonParser.java:743)
    at com.fasterxml.jackson.databind.deser.std.BaseNodeDeserializer.deserializeObject(JsonNodeDeserializer.java:208)
    at com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:69)
    at com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:15)
    at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3736)
    at com.fasterxml.jackson.databind.ObjectMapper.readTree(ObjectMapper.java:2271)
    at org.wololo.geojson.GeoJSONFactory.create(GeoJSONFactory.java:21)
    ... 21 more
jiayuasu commented 7 years ago

@dcalacci I think the reason is that GeoSpark only supports single line GeoJSON format. That means each GeoJSON format spatial object must be written in one line.

Here are three possible solutions:

(1) Convert your multiline GeoJSON to single line GeoJSON

(2) Use ogr2ogr to convert your shp file to "PostgreSQL" format. I think this format must be a single line format (probably in WKT). Then you can easily use GeoSpark to load it. If the format is not standard WKT, you can easily write a customized format mapper to process this format (Example). It may just take you less than one hour to finish all these steps.

(3) If you are not in a rush, you can wait one week for our new GeoSpark patch on Shapefile. Our new patch is almost completed and will be released early next week. It will fully support loading shapefile (shx+dbf) from local disk, S3 and HDFS.

Please feel free to drop me messages if you face any issues.

Thanks, Jia

dcalacci commented 7 years ago

Thanks for the quick response! Is it written in the documentation anywhere that the GeoJSON should be single line?

Some feedback I have about the documentation is that I've found understanding what data format to use pretty confusing.

Also, hooray to (3)! Exciting. Thanks for all your work.

dcalacci commented 7 years ago

@jiayuasu thanks again for the response.

I'm still having trouble using geoJSON. I'm not sure how to export using the psql format without dumping it into a database, which is not really what I want. I can export to a CSV where one field is a WKT string, but I can't figure out how to get geospark to load that file.

To convert to single-line json, I used jq:

cat acs_geo.json | jq -c '.features | .[]` > acs_geo.jsonl

A single line of that json lines file looks like:

{"type": "Feature", "properties": {...}, "geometry": {"type": "MultiPolygon","coordinates":[[[[...]]]]}}

I expected this to work, but no dice. Any tips? I understand that this might be an idiosyncracy in the geojson parser which you did not write.

Thanks!

jiayuasu commented 7 years ago

@dcalacci

There is a bug in GeoSpark GeoJSON loader. It makes GeoSpark fail at loading GeoJSON string that contains "type:Feature".

This has been resolved in the GeoSpark 0.7.1-snapshot. Please change your dependency and try it out.

Now GeoSpark should be able to load

{"type": "Feature", "properties": {...}, "geometry": {"type": "MultiPolygon","coordinates":[[[[...]]]]}}
dcalacci commented 7 years ago

Thank you! loaded properly! :100:

Thanks for the help and for being so responsive. Looking forward to the next release -- this library should help me a ton with my research.