Factual / geo

Clojure library for working with geohashes, polygons, and other world geometry
Eclipse Public License 1.0
301 stars 17 forks source link

Allow transform-geom to use externally created coordinate transforms #43

Closed willcohen closed 5 years ago

willcohen commented 6 years ago

Current functionality of transform-geom either allows for 1) providing a single target CRS or 2) forcing a transformation between two CRSs, but either way this invokes the creation of a CoordinateTransform every time, which is (by far) the most computationally expensive operation. In cases where there's a bunch of geometries to be transformed in the exact same way (all the objects coming out of a shapefile or other geojson, for example), it'd be much more efficient to be able to pass a single transformation and apply that one transform to each. This PR changes transform-geom's two-argument arity to work with either a target CRS or a CoordinateTransform.

Creating a single transform takes multiple milliseconds.

(bench
  (crs/create-transform 4326 23031))
Evaluation count : 8640 in 60 samples of 144 calls.
             Execution time mean : 7.293338 ms
    Execution time std-deviation : 179.947793 µs
   Execution time lower quantile : 7.018824 ms ( 2.5%)
   Execution time upper quantile : 7.565371 ms (97.5%)
                   Overhead used : 9.212071 ns

Found 1 outliers in 60 samples (1.6667 %)
    low-severe   1 (1.6667 %)
 Variance from outliers : 12.5728 % Variance is moderately inflated by outliers

Applying an existing transform to one point takes less than 1 microsecond:

(let [t (crs/create-transform 4326 23031)]
  (bench (transform-geom (point 3.8142776 51.285914) t)))
Evaluation count : 125519880 in 60 samples of 2091998 calls.
             Execution time mean : 479.327737 ns
    Execution time std-deviation : 11.537641 ns
   Execution time lower quantile : 464.750123 ns ( 2.5%)
   Execution time upper quantile : 492.211072 ns (97.5%)
                   Overhead used : 9.212071 ns

Found 1 outliers in 60 samples (1.6667 %)
    low-severe   1 (1.6667 %)
 Variance from outliers : 11.0549 % Variance is moderately inflated by outliers

Applying an existing transform to a multipolygon with multiple coordinates, still takes something on the order of 1 microsecond per coordinate:

(let [t (crs/create-transform 4326 23031)]
  (bench (transform-geom (multi-polygon-wkt [[[10 10, 80 10, 80 80, 10 80, 10 10],
                                              [20 20, 20 30, 30 30, 30 20, 20 20],
                                              [40 20, 40 30, 50 30, 50 20, 40 20]]]) t)))
Evaluation count : 2888040 in 60 samples of 48134 calls.
             Execution time mean : 21.402541 µs
    Execution time std-deviation : 587.682842 ns
   Execution time lower quantile : 20.578319 µs ( 2.5%)
   Execution time upper quantile : 22.384869 µs (97.5%)
                   Overhead used : 9.212071 ns

Found 1 outliers in 60 samples (1.6667 %)
    low-severe   1 (1.6667 %)
 Variance from outliers : 14.2229 % Variance is moderately inflated by outliers
willcohen commented 5 years ago

@worace any thoughts on this PR?

willcohen commented 5 years ago

Makes sense. I think I'd need to add Integer as well since transform-geom lets us use EPSG codes as integers as well. It shouldn't be too complicated but this might mean re-working the logic of transform-geom to deal with the int/string split a little earlier on. I'll try to look at this over the next few days.

I'm not actually sure what the overhead of the instance? is relative to these actual operations are, so I have no idea what the performance improvement will end up being, but it doesn't seem like it'd hurt anything in any case.

willcohen commented 5 years ago

@worace I've been working through this a few different ways and am still not sure if it'll end up having enough of a performance increase to be worth the added complexity, but I still like the idea. Either way, I think it should hold off until after #44, since this PR would actually make crs's create-transform much more commonly used from a public API perspective, and that's the only part of the existing API that actually changes with that dependency change.

worace commented 5 years ago

@willcohen got it. well thanks to looking into it. I'm happy with this way if the other approach is going to be a huge pain. I'll check out that other PR first though.

willcohen commented 5 years ago

@worace There's now a Transformable protocol in the crs namespace. Certainly open to changing the name. The first upside to the protocol is that transformations can work from origin-target where either is a Long, Integer, String, and now CoordinateReferenceSystem, which is an improvement.

There's also a more consistent way of trying to deduce the SRID for any of those 4 objects, which should increase the number of cases where the SRID can be correctly set at the end. (See the added number of cases for get-srid in t_jts. There's currently now both a jts/get-srid, which works on Geometry, and a crs/get-srid, which works on Transformable. The overlap doesn't seem problematic to me, but let me know if you'd prefer a different name.)

I also rebased this pre-emptively on #44, assuming that will land first.

The one remaining consideration is that this still currently has an instance? and satisfies? check on transform-geom. This is because the two-arity has a little ambiguity. I think we want to enable two things -- first, if t is a Transform itself, just apply that; second, if t is a Transformable target, then we should try to transform the geom from its current CRS to the new one. The only way to avoid that comparison is to try to add CoordinateTransform to our the new Transformable protocol, but that feels a little strange to me. Either way, the previous logic of all of the branches in transform-geom now lives in create-transform.

willcohen commented 5 years ago

Unscientific benchmark updates:

With this new protocol, the creation of transforms and application to geoms is similar but now actually a little slower -- I'm not sure, but this overhead may be due to the fact that there's now more functionality available in what can be passed to that transformation and how well it interprets and sets the target SRID via crs/get-srid, which seems like an acceptable tradeoff to me. If you have suggestions for how to make this more efficient, I'd be happy to give it a shot!

Before protocol: 8ms

(bench
  (crs/create-transform 4326 23031))
Evaluation count : 7440 in 60 samples of 124 calls.
             Execution time mean : 8.088982 ms
    Execution time std-deviation : 122.779458 µs
   Execution time lower quantile : 7.890932 ms ( 2.5%)
   Execution time upper quantile : 8.374483 ms (97.5%)
                   Overhead used : 7.551538 ns

Found 2 outliers in 60 samples (3.3333 %)
    low-severe   2 (3.3333 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers

After protocol: 8.5ms

(bench
  (crs/create-transform 4326 23031))
Evaluation count : 7080 in 60 samples of 118 calls.
             Execution time mean : 8.557325 ms
    Execution time std-deviation : 168.217638 µs
   Execution time lower quantile : 8.285679 ms ( 2.5%)
   Execution time upper quantile : 8.885056 ms (97.5%)
                   Overhead used : 7.790776 ns

Before protocol: 475 ns

(let [t (crs/create-transform 4326 23031)]
  (bench (transform-geom (point 3.8142776 51.285914) t)))
Evaluation count : 125127960 in 60 samples of 2085466 calls.
             Execution time mean : 479.239794 ns
    Execution time std-deviation : 7.021314 ns
   Execution time lower quantile : 468.222229 ns ( 2.5%)
   Execution time upper quantile : 492.111577 ns (97.5%)
                   Overhead used : 7.551538 ns

After protocol: 495 ns

(let [t (crs/create-transform 4326 23031)]
  (bench (transform-geom (point 3.8142776 51.285914) t)))
Evaluation count : 123639060 in 60 samples of 2060651 calls.
             Execution time mean : 495.077556 ns
    Execution time std-deviation : 9.900904 ns
   Execution time lower quantile : 475.391011 ns ( 2.5%)
   Execution time upper quantile : 509.360320 ns (97.5%)
                   Overhead used : 7.790776 ns

Found 1 outliers in 60 samples (1.6667 %)
    low-severe   1 (1.6667 %)
 Variance from outliers : 7.8812 % Variance is slightly inflated by outliers

Before protocol: 20 microseconds

(let [t (crs/create-transform 4326 23031)]
  (bench (transform-geom (multi-polygon-wkt [[[10 10, 80 10, 80 80, 10 80, 10 10],
                                              [20 20, 20 30, 30 30, 30 20, 20 20],
                                              [40 20, 40 30, 50 30, 50 20, 40 20]]]) t)))
Evaluation count : 2957760 in 60 samples of 49296 calls.
             Execution time mean : 19.682421 µs
    Execution time std-deviation : 445.235498 ns
   Execution time lower quantile : 19.016432 µs ( 2.5%)
   Execution time upper quantile : 20.577034 µs (97.5%)
                   Overhead used : 7.551538 ns

Found 1 outliers in 60 samples (1.6667 %)
    low-severe   1 (1.6667 %)
 Variance from outliers : 10.9740 % Variance is moderately inflated by outliers

After protocol: 21 microseconds

(let [t (crs/create-transform 4326 23031)]
  (bench (transform-geom (multi-polygon-wkt [[[10 10, 80 10, 80 80, 10 80, 10 10],
                                              [20 20, 20 30, 30 30, 30 20, 20 20],
                                              [40 20, 40 30, 50 30, 50 20, 40 20]]]) t)))
Evaluation count : 2998200 in 60 samples of 49970 calls.
             Execution time mean : 20.966107 µs
    Execution time std-deviation : 377.768999 ns
   Execution time lower quantile : 20.049906 µs ( 2.5%)
   Execution time upper quantile : 21.608664 µs (97.5%)
                   Overhead used : 7.790776 ns

Found 3 outliers in 60 samples (5.0000 %)
    low-severe   3 (5.0000 %)
 Variance from outliers : 7.7763 % Variance is slightly inflated by outliers
willcohen commented 5 years ago

@worace I think this is looking better now -- I'd forgotten that proj4j can use .getSourceCRS to try to pull CRS identifiers out of CoordinateTransforms, which means that the logic for checking what SRID to set can be made a little simpler.

This also means that the structure for tf and transform-geom could be streamlined -- the source/target arity version of create-transform can be reduced to calling the function with a generated transform, and tf can always try to figure out the SRID even though it only accepts a CoordinateTransform.

It also means that jts/set-srid should be able to accept not just an integer as its coordinate argument, but any kind of Transformable where crs/get-srid can be used.

worace commented 5 years ago

Looks great to me. Nice work with all the protocol stuff. I am not as concerned with the perf aspects as with the structure and avoiding lots of instance? calls for different use cases. But thanks for looking into the perf stuff anyway.