Esri / spatial-framework-for-hadoop

The Spatial Framework for Hadoop allows developers and data scientists to use the Hadoop data processing system for spatial data analysis.
Apache License 2.0
367 stars 159 forks source link

Convert Hive UDFs to GenericUDF #54

Closed climbage closed 10 years ago

climbage commented 10 years ago

UDFs that extend GenericUDF have more control over how the objects passed to them are handled. Here are a few of the benefits of switching to generic - I'll use ST_Contains as an example.

Simplified expressions

ST_Contains takes two arguments and checks to see if the first geometry contains the geometry argument.

ST_Contains(ST_GeomFromText('POLYGON (( ... ))'), ST_Point(longitude, latitude))

Using GenericUDF we can support different data types and simplify the previous expression to

ST_Contains('POLYGON (( ... ))', ST_Point(longitude, latitude))

Optimizing for constants

In the last expression, 'POLYGON (( ... ))' is a literal constant. We can use this property to optimize at least two different aspects of the operator.

  1. only create the OGCGeometry once per instance of the GenericUDF class
  2. if the first argument to ST_Contains is constant, we can accelerate the corresponding OGCGeometry.

Eliminate unneeded (de)serialization between UDFs

This isn't necessarily related to GenericUDF conversion, but can happen at the same time.

Example:

ST_Contains('POLYGON (...)', ST_Point(lon,lat))

In this case, ST_Point serializes the OGCPoint to bytes and then ST_Contains deserializes those bytes back to an OGCPoint. This is expected behavior when values are passed across nodes, but in the case of this example, each call to ST_Contains almost certainly happens on the same node as the corresponding call to ST_Point. This is not a huge performance hit for points, but for larger polygons it will be.

ddkaiser commented 10 years ago

I don't remember the exact discussion but I thought the vote against GenericUDF initially was due to some sort of performance penalty (perhaps Java reflection being in play?) so from a purely optimal performance case of any unknown problem, the GenericUDF pattern was slower? (I don't have any notes or anything but that seemed like something we had discussed.)

Having said that, even without the simplified expressions aiding a query author by simplifying language or implementation... I would guess the "Optimizing for constants" argument you presented sounds like a perfect reason to implement. Accelerating and caching a constructed geometry would likely outperform any penalty due to reflection, especially as the record count (number of times the accelerated geometry is re-used) increases.

climbage commented 10 years ago

I believe it was due to time constraints and ease of development, but you could be right. I would think UDF suffers more from reflection than GenericUDF.

Either way, limited testing shows a ~40% decrease in cumulative CPU time with ST_Contains extends GenericUDF on 14 million points.

I'll have it in a branch later today/this weekend.

climbage commented 10 years ago

Using a larger, more complex polygon as the first argument to generic ST_Contains yields a ~300% decrease in cumulative CPU time. 130 seconds vs 410 seconds.

climbage commented 10 years ago

More performance numbers using the source in the genericudf branch.

Running the sample query on a dataset with 14 million points takes 13 minutes before optimization and just under a minute after optimization.

SELECT counties.name, count(*) cnt FROM counties
JOIN gold.faa
WHERE ST_Contains(counties.boundaryshape, ST_Point(faa.longitude, faa.latitude))
GROUP BY counties.name
ORDER BY cnt desc;

image