Job Not Completing - Githubissues

Esri / gis-tools-for-hadoop

The GIS Tools for Hadoop are a collection of GIS tools for spatial analysis of big data.

http://esri.github.io/gis-tools-for-hadoop/

Apache License 2.0

519 stars 254 forks source link

Job Not Completing #48

Closed mkeller3 closed 8 years ago

mkeller3 commented 8 years ago

I am trying to count how many points are within each zip code in the United States and it takes over 21 hours to complete.

If i use a very basic json such as states it runs within 3 minutes but anytime i try to use polygons that are far more advance the job takes substantially longer.

Is there any reasoning behind this?

randallwhitman commented 8 years ago

Is that in Hive? For better performance of point-in-polygon aggregation, try a custom map-reduce application with a quadtree index on the polygons, as in the sample.

mkeller3 commented 8 years ago

It is in Hive.

randallwhitman commented 8 years ago

Right - if the performance of the Hive query is not adequate, try custom MapReduce.

mkeller3 commented 8 years ago

Randall,

Running the custom MapReduce does not work either. I run into the error:

Error: GC overhead limit exceeded

randallwhitman commented 8 years ago

Can you give the JVM more memory?

mkeller3 commented 8 years ago

That did not help.

mkeller3 commented 8 years ago

Could this be due to the fact that only one mapper and reducer is being used on such a large polygon dataset?

randallwhitman commented 8 years ago

Yes, single mapper and reducer would certainly cause performance and/or scalability issues.

mkeller3 commented 8 years ago

Is there anyway to force more mappers and reducers to be used?

randallwhitman commented 8 years ago

In custom map-reduce, number of reducers, yes.