Esri / gis-tools-for-hadoop

The GIS Tools for Hadoop are a collection of GIS tools for spatial analysis of big data.
http://esri.github.io/gis-tools-for-hadoop/
Apache License 2.0
521 stars 254 forks source link

Enclosed InputFormats do not work #83

Open doublebyte1 opened 5 years ago

doublebyte1 commented 5 years ago

I am following the instructions in this tutorial, and I am able to create a table using the UnenclosedEsriJsonInputFormat.

However, I would like to use the enclosed format.

I have tried these two serdes:

      CREATE TABLE taxi_agg(area BINARY, count DOUBLE)
      ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.EsriJsonSerDe' 
      STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedEsriJsonInputFormat'
      OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
      CREATE TABLE taxi_agg(area BINARY, count DOUBLE)
      ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.GeoJsonSerDe' 
      STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedGeoJsonInputFormat'
      OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

Although I am able to create the table, and insert data, when I do a select the result is always empty: select ST_AsGeoJSON(area), count from taxi_agg; Changing EnclosedEsriJsonInputFormat to UnenclosedEsriJsonInputFormat, or EnclosedGeoJsonInputFormat to UnenclosedGeoJsonInputFormat gives correct results.

Not sure if I am doing something wrong, or if there is a problem with the Enclosed Serde.

Version: 2.0.0

randallwhitman commented 5 years ago

Thanks for reporting this. I assume "version 2.0.0" refers to Spatial Framework for Hadoop. Please let us know the versions of Hive and Hadoop.

doublebyte1 commented 5 years ago

@randallwhitman Hadoop 2.8.5, Hive 2.3.4

randallwhitman commented 5 years ago

Thanks for the details. We do not have Hive-2.3.4 (nor Hadoop-2.8.5) installed, and unfortunately the testing framework is not at the level of making it easy to paste a sample query into a test - Esri/spatial-framework-for-hadoop#163. Maybe it will reproduce with another version of Hive or with SparkSql.

doublebyte1 commented 5 years ago

I can confirm that both issues reproduce on Hadoop 2.8.3 and Hive 2.3.2.

randallwhitman commented 5 years ago

I took a look at reading Enclosed Esri JSON, using 15 points from the JSON-MR mini-sample, and Hive-2.3.5 read the table data OK.

create external table test15eej(rowid int, shape binary)
row format serde 'com.esri.hadoop.hive.serde.EsriJsonSerDe'
stored as inputformat 'com.esri.json.hadoop.EnclosedEsriJsonInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location 'hdfs://hdfs:8020/path/to/test15_eej';

hive> select rowid, ST_AsText(shape) from test15eej;
1505    POINT (15 5)
535     POINT (5 35)
2323    POINT (23 23)
3222    POINT (32 22)
3728    POINT (37 28)
2233    POINT (22 33)
2838    POINT (28 38)
3434    POINT (34 34)
6219    POINT (62 19)
7114    POINT (71 14)
7525    POINT (75 25)
6535    POINT (65 35)
5549    POINT (55 49)
6545    POINT (65 45)
4566    POINT (45 66)

I guess that tests only reading not writing.

randallwhitman commented 4 years ago

Finally repro the reported issue.

create external table test15eej(rowid int, shape binary)
row format serde 'com.esri.hadoop.hive.serde.EsriJsonSerDe'
stored as inputformat 'com.esri.json.hadoop.EnclosedEsriJsonInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location 'file:///tmp/test15eej';
hive> select rowid, ST_AsText(shape) from write15eej;
OK
Time taken: 0.154 seconds

The output file was in fact unenclosed - cat /tmp/write15eej/000000_0 :

{"attributes":{"rowid":1505},"geometry":{"x":15,"y":5}}
{"attributes":{"rowid":535},"geometry":{"x":5,"y":35}}
{"attributes":{"rowid":2323},"geometry":{"x":23,"y":23}}
{"attributes":{"rowid":3222},"geometry":{"x":32,"y":22}}
{"attributes":{"rowid":3728},"geometry":{"x":37,"y":28}}
{"attributes":{"rowid":2233},"geometry":{"x":22,"y":33}}
{"attributes":{"rowid":2838},"geometry":{"x":28,"y":38}}
{"attributes":{"rowid":3434},"geometry":{"x":34,"y":34}}
{"attributes":{"rowid":6219},"geometry":{"x":62,"y":19}}
{"attributes":{"rowid":7114},"geometry":{"x":71,"y":14}}
{"attributes":{"rowid":7525},"geometry":{"x":75,"y":25}}
{"attributes":{"rowid":6535},"geometry":{"x":65,"y":35}}
{"attributes":{"rowid":5549},"geometry":{"x":55,"y":49}}
{"attributes":{"rowid":6545},"geometry":{"x":65,"y":45}}
{"attributes":{"rowid":4566},"geometry":{"x":45,"y":66}}
create external table alt15uej(rowid int, shape binary)
row format serde 'com.esri.hadoop.hive.serde.EsriJsonSerDe'
stored as inputformat 'com.esri.json.hadoop.UnenclosedEsriJsonInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location 'file:///tmp/write15eej'
hive> select rowid, ST_AsText(shape) from alt15uej limit 2;
OK
1505    POINT (15 5)
535     POINT (5 35)
Time taken: 0.146 seconds, Fetched: 2 row(s)

With larger data, the output would be expected to span multiple files. In that case, it's not clear how the file[s] could be enclosed at all - maybe each file of the collection could have Enclosed format?