Esri / gis-tools-for-hadoop

The GIS Tools for Hadoop are a collection of GIS tools for spatial analysis of big data.
http://esri.github.io/gis-tools-for-hadoop/
Apache License 2.0
521 stars 254 forks source link

Converting taxi_agg table to JSON #62

Closed MjHow912 closed 4 years ago

MjHow912 commented 7 years ago

After running the Taxi_demo results in Hive, I want to convert the table back to JSON so that I can export it to a shared location and have my users can access it and open in ArcMap. I cannot use the copy tool in ArcMap because my cluster is kerberized and opens more troubles. It's a better process for me to convert the output back to JSON. Has anyone discovered how to do so?

randallwhitman commented 7 years ago

Use the EsriJsonSerDe - see the trip-discovery sample (but use the Spatial-Framework-v2 names - the sample needs update for that).

MjHow912 commented 7 years ago

So if table was created and saved in hive as: CREATE TABLE taxi_agg(area BINARY, count DOUBLE) ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.JsonSerde'
STORED AS INPUTFORMAT 'com.esri.json.hadoop.UnenclosedJsonInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

To get back to JSON format for exporting, I found this code and altered it a bit: ALTER TABLE taxi_agg SET FILEFORMAT INPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' OUTPUTFORMAT 'com.esri.json.hadoop.EnclosedJsonInputFormat' SERDE 'com.esri.hadoop.hive.serde.EsriJsonSerDe';

Error: FAILED: SemanticeException Cannot find class 'com.esri.hadoop.hive.serde.EsriJsonSerDe

I also tried: ALTER TABLE test15json SET FILEFORMAT INPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' OUTPUTFORMAT 'com.esri.json.hadoop.EnclosedJsonInputFormat' ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.EsriJsonSerDe';

Error: MismatchedTokenException(226!=233) at org.antlr.runtime.BaseRecognizer.recoverFromMismatchedToken(BaseRecognizer.java:617) at org.antlr.runtime.BaseRecognizer.match(BaseRecognizer.java:115) at org.apache.hadoop.hive.ql.parse.HiveParser.fileFormat(HiveParser.java:15604) at org.apache.hadoop.hive.ql.parse.HiveParser.alterStatementSuffixFileFormat(HiveParser.java:12890) at org.apache.hadoop.hive.ql.parse.HiveParser.alterTblPartitionStatementSuffix(HiveParser.java:8406) at org.apache.hadoop.hive.ql.parse.HiveParser.alterTableStatementSuffix(HiveParser.java:8159) at org.apache.hadoop.hive.ql.parse.HiveParser.alterStatement(HiveParser.java:7433) at org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2702) at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1665) at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1122) at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:202) at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:437) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:320) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1219) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1260) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1156) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1146) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:217) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:169) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:380) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:740) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:685) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:233) at org.apache.hadoop.util.RunJar.main(RunJar.java:148) FAILED: ParseException line 4:1 mismatched input 'ROW' expecting SERDE near ''com.esri.json.hadoop.UnenclosedJsonInputFormat'' in file format specification

Any suggestions on what to try?

randallwhitman commented 7 years ago

Get Spatial Framework v2

MjHow912 commented 6 years ago

@randallwhitman Installed. And tabled altered. I still run into the problem of exporting it though. The webhdfs is going to work because of authenticating to a kerberized cluster. If I don't use the ArcMap tool of copying from hdfs and turn into JSON, can it be done manually? arg

randallwhitman commented 6 years ago

If you have access, scp and/or hdfs dfs cli etc.

MjHow912 commented 6 years ago

But that's invalid file. It's not JSON. The conversion to JSON is still missing. In the example it's done in the gistool. When moving manually, where/how is the conversion done?

randallwhitman commented 6 years ago

Remind me, conversion of exactly what to what?

MjHow912 commented 6 years ago

https://github.com/Esri/gis-tools-for-hadoop/wiki/Aggregating-CSV-Data-%28Spatial-Binning%29

Attempting to move the taxi_agg table into ArcMap, the steps after 12 are only geared towards transporting and transforming via the GISTool.

randallwhitman commented 6 years ago

Can you spell out what needs to be converted, from what format, to what format?

MjHow912 commented 6 years ago

The Swede allows you to take the samlle data that is in json format and allows you concert it in something that is readable in hive. Once you are finished meeting the data, the taxi_agg table is still in a format that is readable for hive. In order it to be viewed on arcMap it has to go through a conversion process; 1. copy hive table from hdfs -> temporary JSON file 2. JSON -> feature file.

The gistool toolkit that ArcMap has does the first conversion for you. If your unable to connect to hive from ArcMap, there needs to be a manual way of converting the hive table to JSON. If you export the hive table the file is deflate and does not work in ArcMap because it doesn't recognize the file type.

randallwhitman commented 6 years ago

Take a look at the trip-discovery sample as mentioned 11/17. The output is converted to JSON, under the section entitled "Import into ArcMap and visualize" in the associated blog post. Update the recipes for v2.

MjHow912 commented 6 years ago

@randallwhitman I believe there is a disconnect in what I'm attempting to explain to you. Once you run, create external table trip_origin_json (totnum int, samedest int, pct double, destlhs double, destbot double, destrhs double, desttop double, shape binary) row format serde 'com.esri.hadoop.hive.serde.JsonSerde' stored as inputformat 'com.esri.json.hadoop.UnenclosedJsonInputFormat' outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' location '/user/rwhitman/trip-origin-json';

yes, you technically have a JSON table that's compressed. Unfortunately, that file format is .deflate not .json. In every example show, the only way to go from .deflate file to .JSON file is via the gistoolkit in ArcMap. I'm clueless as to how to do this part manually. How do I go from .deflate to .JSON without the toolkit, At this point I may need to run some sort of decompression code. Do you know what the commands are that are ran in the "Copy from HDFS" that takes 'trip-origin-json to trips-cor-01-tgcs.json'?

randallwhitman commented 6 years ago

I don't remember the JSON being compressed. Maybe that is a change in a newer version of Hive?

Try something like hdfs dfs -cat path/to/file | zcat.

Do you know what the commands are that are ran in the "Copy from HDFS" [tool]

The source is open ..

GISDev01 commented 6 years ago

@MjHow912 If I understand your question correctly, starting with Step 6 on this page: https://github.com/Esri/gis-tools-for-hadoop/wiki/Getting-the-results-of-a-Hive-query-into-ArcGIS

The ArcToolBox GeoProcessing (GP) Tool 'Copy From HDFS' is just running this code: https://github.com/Esri/geoprocessing-tools-for-hadoop/blob/master/HadoopTools.pyt#L193 The main business logic will be found here: https://github.com/Esri/geoprocessing-tools-for-hadoop/blob/master/webhdfs/webhdfs.py#L126

The GP Tool 'JSON To Features' is just running this code found here: https://github.com/Esri/geoprocessing-tools-for-hadoop/blob/master/HadoopTools.pyt#L369 With the main business logic found here (for unenclosed JSON): https://github.com/Esri/geoprocessing-tools-for-hadoop/blob/master/JSONUtil.py#L179

After you take a look at that code, please post any questions or errors you run in to with your kerberized cluster (I didn't see any errors that you ran into from having your cluster kerberized in the Original Issue Post. If you can post the actual error you get when running the GP Tool, we could probably trace down the problem and fix it in the repo, and then the GP Tool would work for others with kerberized clusters in the future).