Closed climbage closed 10 years ago
I don't remember the exact discussion but I thought the vote against GenericUDF initially was due to some sort of performance penalty (perhaps Java reflection being in play?) so from a purely optimal performance case of any unknown problem, the GenericUDF pattern was slower? (I don't have any notes or anything but that seemed like something we had discussed.)
Having said that, even without the simplified expressions aiding a query author by simplifying language or implementation... I would guess the "Optimizing for constants" argument you presented sounds like a perfect reason to implement. Accelerating and caching a constructed geometry would likely outperform any penalty due to reflection, especially as the record count (number of times the accelerated geometry is re-used) increases.
I believe it was due to time constraints and ease of development, but you could be right. I would think UDF
suffers more from reflection than GenericUDF
.
Either way, limited testing shows a ~40% decrease in cumulative CPU time with ST_Contains extends GenericUDF
on 14 million points.
I'll have it in a branch later today/this weekend.
Using a larger, more complex polygon as the first argument to generic ST_Contains
yields a ~300% decrease in cumulative CPU time. 130 seconds vs 410 seconds.
More performance numbers using the source in the genericudf branch.
Running the sample query on a dataset with 14 million points takes 13 minutes before optimization and just under a minute after optimization.
SELECT counties.name, count(*) cnt FROM counties
JOIN gold.faa
WHERE ST_Contains(counties.boundaryshape, ST_Point(faa.longitude, faa.latitude))
GROUP BY counties.name
ORDER BY cnt desc;
UDFs that extend
GenericUDF
have more control over how the objects passed to them are handled. Here are a few of the benefits of switching to generic - I'll useST_Contains
as an example.Simplified expressions
ST_Contains
takes two arguments and checks to see if the first geometry contains the geometry argument.Using
GenericUDF
we can support different data types and simplify the previous expression toOptimizing for constants
In the last expression,
'POLYGON (( ... ))'
is a literal constant. We can use this property to optimize at least two different aspects of the operator.OGCGeometry
once per instance of theGenericUDF
classST_Contains
is constant, we can accelerate the correspondingOGCGeometry
.Eliminate unneeded (de)serialization between UDFs
This isn't necessarily related to GenericUDF conversion, but can happen at the same time.
Example:
In this case,
ST_Point
serializes theOGCPoint
to bytes and thenST_Contains
deserializes those bytes back to anOGCPoint
. This is expected behavior when values are passed across nodes, but in the case of this example, each call toST_Contains
almost certainly happens on the same node as the corresponding call toST_Point
. This is not a huge performance hit for points, but for larger polygons it will be.